Introduction to Proteomics &
Bottom-up Proteomics
W. Andy Tao
Purdue University
WWhat is Proteomics?
•� Analysis of the entire PROTEin complement
expressed by a genOME of a cell or tissue type (Mac
Wilkins)
•� Proteomics focuses state-related expression of proteins
in biological samples
•� Proteomics is systematic analysis and documentation
•� Proteomics identifies and quantifies proteins, as well
as determines their localization, modifications,
interactions, activities, and ultimately, their functions
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
HHistorical Perspective 1. 1970’s: 2D gels & separation of hundreds of proteins at once
2. Mid 90’s: huge expansion •� easy-to-use high performance mass spectrometers
–� MALDI, ESI
–� high resolution, sensitivity, accuracy, MS/MS –� protein and peptide measurement and peptide fragmentation
•� genome projects –� human, bacterial, plant, other animals –� huge searchable databases (bioinformatics)
Considerable interest once transcriptome was found to be poor predictor of proteome.
3. 21st century: consolidation •� Post-translational modifications
•� Relative and absolute quantitation •� Targeted analyses •� Biological applications
It’s all about separation and identification!
��Separation �� Multi-dimensional �� Gel-based or gel-free
��Identification �� TTop-down (lectured by Dr. Yu Xia)
•� Analytes are proteins •� ECD for the fragmentation of proteins •� Almost exclusively in FT-ICR
�� Bottom-up •� Analytes are peptides (digested from proteins) •� CID is the most common method for fragmenting peptides •� In any mass spectrometer
�� Middle-down •� Large peptides (5k-20k Da) •� Its approach is similar to top down
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
•�Why bottom up to analyze peptides? �� They are smaller and therefore easier to ionize for
MS analysis
�� Fragment ion spectra are much easier to interpret
�� In general it is possible to get much greater sequence coverage using a mixture of peptides that are analyzed individually
�� Using peptide separation techniques it is possible to reduce a complex protein into simpler fractions for analysis
•� Enzymatic digestion of proteins �� Sequence-specific proteases – complete digestion
�� Non-specific proteases – partial digestion
•� Chemical treatment of proteins
•� TThe basic scheme for peptide generation and analysis:
Protein isolation (PAGE, chromatography, etc) - optional
Reduction and alkylation
Enzymatic digestion or chemical cleavage
Sample cleanup and/or peptide separation (RP-HPLC)
Mass spec analysis
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
PProtein Identification by Mass Spectrometry
��Mass Fingerprint (MS)
��Tandem Mass Spectrometry
•�What is it? �� A method for identifying an unknown protein based on
measurement of the masses of peptides generated from it
�� Requires a high mass accuracy mass spectrometer, a sequenced genome for the organism the sample is derived from, and a database search algorithm
•�How does it work? �� Just like every person has a unique fingerprint pattern
that can be used to identify them, every protein generates a unique set of peptides after site-specific proteolysis
�� The collective masses of these peptides then is also unique to the parent protein and can likewise be used to identify it
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
PPMF by MALDI-TOF MS:
gel separated proteins
extract protein, digest with enzyme such as trypsin
NH2
NH2
NH2
-COOH R K -COOH
-COOH K tryptic peptides
MALDI-TOF analysis
search masses against database
m/z
Mass Mapping Peptide Mass Fingerprint
List of peptide masses from MS scan
Sequence Database
Identified Protein
m/z
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
EExample
2978.4567 2646.2992 2186.1678 1981.8621 1131.4384 830.4519 780.4978 748.3698 742.4497 646.3228 ….
Step 1:
EExample
In silico digestion w/ trypsin
>sp|P02666|CASB_BOVIN Beta-casein OS=Bos taurus MKVLILACLVALALARELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDEL QDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPK HKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSL SQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV
MK VLILACLVALALAR ELEELNVPGEIVESLSSSEESITR INKKIEK FQSEEQQQTEDEL QDK IHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVK EAMAPK HK EMPFPK YPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSL SQSK VLPVPQK AVPYPQR DMPIQAFLLYQEPVLGPVR GPFPIIV
2646.2992 2186.1678 1981.8621 830.4519 780.4978 748.3698 742.4497 646.3228
2978.4567 2646.2992 2186.1678 1981.8621 1131.4384 830.4519 780.4978 748.3698 742.4497 646.3228 ….
Match!!
Step 2:
Step 3:
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
Online database search
Mascot
matrixscience.com
Online database search
MS-FIT
prospector.ucsf.edu
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
PProtein Identification by Mass Spectrometry
��Mass Fingerprint
��Tandem Mass Spectrometry
•� There are numerous methods out there for inducing peptide fragmentation
•� Common methods for peptide fragmentation �� Post-source decay (PSD) – only in MALDI
�� Collision-induced Dissociation (CID)
�� Electron Capture Dissociation (ECD)
�� Electron Transfer Dissociation (ETD)
•� CID is by far the most common and we will focus only on it, and ECD and ETD will be discussed by Dr. Xia
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
CCID Data Acquisition
PPeptide CID Fragmentation
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
FFragmenting A Peptide
SSequence and Tandem MS Spectrum
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
FFragment Ions on A Tandem MS Spectrum
SStrategy for Protein Sequencing by Database Searching
theoretical theoretical
theoretical theoretical
theoretical theoretical
theoretical theoretical
theoretical theoretical
m/z
%
Correlative database sequence search
EDACLGAJK
Identified protein
Database
Theoretical digestion (in silico); Theoretical fragmentation (in silico)
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
SStrategy for Protein Sequencing by Database Searching
PParameters for Database Searching
Sequence Databases
Protein sequence databases:
NCBI EMBL Swissprot IPI
UniPro
Genomic and Expressed Sequence Tag (EST) databases:
Human genome (draft completed in Oct 2004): ~25,000 Yeast genome (completed in April 1996): ~6,000 Mouse genome (completed in Dec 2004): ~25,000 Rice genome (draft completed in April 2002): >45,000
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
PParameters for Database Searching
Tandem MS
Mass accuracy of precursor ion
http://las.perkinelmer.com/Content/RelatedMaterials/007069_01.pdf
PParameters for Database Searching
Tandem MS
Fragmentation pattern of Tandem MS
Good Fair
Bad Terrible
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
PParameters for Database Searching
Search Algorithm
Search engines:
Sequest (Thermo) Mascot (Matrix) X!Tandem (free) SpectrumMill (Agilent)
Shadforth et al, Proteomics, 2005, 5, 4082-4095 (review)
Eng et al, JASMS, 5, 976 (Sequest)
Perkins et al Electrophoresis. 20:3551 (Mascot)
Kapp et al, Proteomics, 5, 3475 (comparison)
PParameters for Database Searching
Search Algorithm
SequestTM (patented the technique to use tandem MS and database searching for sequencing)
ni: the number of matched ions im: abundances of matched ions �: matched consecutive ions �: immonium ions associated with amino acid residues nt: total number of predicted sequence ions
Scoring
Cross-correlation (Xcorr)
x(t): signals from the reconstructed spectrum based on amino acid sequences
y(t): signals from the reconstructed experimental spectrum �: displacement value between the two signals
Final value Cn = C�=0 – C-75<�<75, after normalization
�Cn: difference between top two hits.
Eng et al, JASMS, 5, 976.
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
PParameters for Database Searching
Search Algorithm
MascotTM
Scoring: MOWSE program & probability –based scoring
MOWSE: MMOlecular WWeight SSEarch. Bleasby Pappin DJC, Hojrup P, and Bleasby AJ (1993) Rapid identification of proteins by peptide-
mass fingerprinting. Curr. Biol. 3:327-332
Scoring based on peptide frequency distribution from non redundant Database.
��Takes into account relative abundance of peptides in the database when calculating scores.
��Protein size is compensated for.
Mascot score SS = 10Log(P), where P is the probability that the observed match is a random event
Probability-based scoring ��The probability that the observed match between experimental data and a protein sequence is a random event is approximately calculated for each protein in the sequence database. ��Probability model details not published.
CComparison Between MS/MS Search Algorithms
Heuristic Algorithm
Sequest Spectrum Mill X! Tandem
Probabilistic Algorithm
MASCOT PeptideProphet (rescoring algorithm)
Kapp et al, Proteomics, 5, 3475.
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
IIdentification of Post-translational Modifications
Static �� Simply change mass of any residue (or N- or C-terminus). �� Adds no additional computer processing time. �� Preprocessed or pre-calculated peptide masses no longer valid
Variable/Differential �� Specified residues need to be considered as un-modified or modified in
all combinations. �� Search complexity increases (a lot). �� Search time increases dramatically. �� Preprocessed or pre-calculated peptide masses no longer directly valid
SSeparation
Why do we need separations?
�� Concentrate
�� Multidimensional selectivity
�� Eliminate interfering substances
�� To be compatible with MS instruments
(time frame; ion suppression; trapping capacity)
Used for MS Short Course at Tsinghua by R. Graham Cooks, Hao Chen, Zheng Ouyang, Andy Tao, Yu Xia and Lingjun Li
SSeparation Methods Coupling to MS
Separation Interface MS
2D PAGE (off line) IEF (mainly off line)
LC Electrophoresis
Ion Trap Quadrupole
TOF FTICR Sector
22D PAGE-MS
2D-PAGE still is a powerful separation technique but has several disadvantages:
•� Restricted to proteins < 106 and > 104 Da MW •� Cannot detect proteins expressed at low levels •� Typically limited to 600~800 separate spots •� Gel to gel reproducibility is poor •� Quantitation is poor, ± 50% or worse •� Dynamic range is limited, < 10X •� Analysis is not directly coupled to separation •� Analysis of membrane proteins is poor •� Time-consuming process
LLimitations of 2D PAGE
2D-PAGE still is a powerful separation technique but has several disadvantages:
•� Restricted to proteins < 106 and > 104 Da MW •� Cannot detect proteins expressed at low levels •� Typically limited to 600~800 separate spots •� Gel to gel reproducibility is poor •� Quantitation is poor, ± 50% or worse •� Dynamic range is limited, < 10X •� Analysis is not directly coupled to separation •� Analysis of membrane proteins is poor •� Time-consuming process
LLimitations of 2D PAGE
LLC Coupled to ESI
Reverse-phase LC
Decrease column size, increase sensitivity
2.1 cm
1.0 cm
4.6 mm
1.0 mm
50 �m
Sensitivity
21 fold
441 fold
----
4.4 fold
176,400 fold
Size
LLC Coupled to ESI
Typical system (micro-column) for proteomics research:
Diameter: 15-100 μm x 10-20 cm
Packing material: 3-5 μm C18 silica
Flow rate: 100-300 nL/min
Spray needle: <30 μm
Mobile phase: Solvent A- 0.1% Formic acid (or 0.5% acetic acid)
B-acetonitrile with 0.1% formic acid (or 0.5% acetic acid)
Injection system: typically 1-10μl (direct injection or autosampler injection)
Gradient: most peptides elute at B 10-40%
550-100 μm fused silica capillary packed with C18
Packed C18 Tip for μLC-MS
IIon chromatogram
MS @ 27.40 min
MS2 @ 27.41 min
MS2 @ 27.41 min
MS2 @ 27.42 min
MS2 @ 27.43 min
LC-MS/MS Data-dependant Acquisition
�� Multiple peptide per protein redundancy (avg. 50)
�� Specific precursors selected repeatedly (in spite of dynamic exclusion)
�� Only a fraction of sequencing attempts is successful
�� Only a fraction of successful sequencing attempts identify differentially regulated proteins
�� There is a conflict between sensitivity (small column) and sample capacity (larger column).
FFactors Limiting Efficiency of LC-MS/MS Experiments
�� Average peak width is 10-30s for a single peptide on
RPLC. For 60 min gradient, average No. of peptides
at any moment is 200 peptides for a 500 protein
mixture.
�� Proteins in cells range in the concentration of over 7
orders of magnitude. Low abundant peptides are
usually overwhelmed by high abundant ones.
�� Limited MS data acquisition speed; Ion surrpresion;
Trapping capacity.
FFactors Limiting Efficiency of LC-MS/MS Experiments
OOrthogonal SSeparation Methods
1st dimension
1D PAGE IEF Affinity chromatography Electrophoresis Ion exchange Size exclusion Specific targeting (e.g., chemical derivation)
RPLC ???
2nd dimension X
TTwo columns in one
IEX resin RP-resin
�� Principle: Two different column packing materials in same capillary �� 2D chromatography in a single operation �� Increase sample capacity?
MudPIT: Multi-dimensional Protein Identification Technology
Washburn et al Nat Biotechnol 2001
TTwo-dimensional Orthogonal HPLC Separation
RP SCX
Buffer A B C D
Mass Spectrometer
MudPIT Cycle 1) load sample 2) wash 3) salt step (stepwise increase conc.) 4) wash 5) RP gradient 6) re-equilibration 7) go to step (3)
Sample
Buffers A: 5% Acetonitrile/0.02%HFBA B: 80% Acetonitrile/0.02%HFBA
C: 250 mM NH4Ac/0.02%HFBA D: 500 mM NH4Ac/0.02%HFBA
Bottom-up Proteomics II
CCovered topics
��Quantitative proteomics
�� Post-translational modifications
�� Protein complex
WWhy quantify proteins?
What do you want to learn from a
quantitative proteomics experiment?
WWhy quantify proteins?
DNA
mRNA
Functional Protein
Post-translational modifications
Degradation
The correlation between mRNA levels and protein expression levels is low (correl coef < 0.4 overall)
Biological functions
X
�� MALDI: ionization hot spots �� Different ionization efficiency for different peptides �� Variable ion transmission �� Competition for charges �� Point of precursor ion selection in chromatographic peak
There is a poor correlation between the amount of a peptide present and the MS and MS/MS signal intensities
HHow to quantify proteins?
AAccurate Quantitation Using Isotope Dilution
Sample 1 Sample 2
(Reference)
Incorporate
Stable heavy (H)
Isotope
Incorporate
Stable regular (L)
Isotope
Analyze by Mass Spectrometer
Combine Samples
•� H/L analytes are chemically identical � identical specific signal in MS
•� Ratio of H/L signals indicates ratio of analytes
Metabolic stable isotope labeling
Isotope tagging by chemical reaction
Digest
Label
Stable isotope incorporation via enzyme reaction
Inte
nsity
Inte
nsity
Inte
nsity
m/z m/z m/z
oot ppepopto eliinggngg
Digest
y
Digest
Stable Isotope Labeling Strategies
Metabolic stable isotope labeling
Inte
nsity
m/z
oot ppepopto eliinggngg
Digest Prototypical applications:
•�Zhou et al, RCMS (2002)
•�Mann et al, Mol Cell Prot (2003)
•�Veenstra eta l JASMS (2000)
Stable Isotope Labeling Strategies
��15N-enriched media (ammonium sulfate-15N for
yeast culture)
��Amino acid (Lys-13C, Arg-13C) for mammalian
cell lines
SStable Isotope Tagging by Metabolic Labeling
Strengths �� No chemical reactions �� Potentially all peptides
labeled �� Simple labeling
protocols �� Quantitative labeling -
no side reactions
Weaknesses �� Compatible with selected
species, samples only �� No inherent sample
enrichment �� Labeling potentially
perturbs biological system �� Label potentially
metabolized �� Mass difference between
sequence identical peptides can vary
SStable Isotope Tagging by Metabolic Labeling
SILAC: Stable isotopic labeling with amino acids in cell culture
Lys-13C
Arg-13C
Stable isotope incorporation via enzyme reaction
Inte
nsity
m/z
y
Digest
Prototypical applications:
•�Stewart et al, Rapid Comm in Mass Spec
(2001)
•�Reynolds KJ et al, J Prot Res (2002)
•�Schevchenko A et al, Rapid Comm in Mass
Spec (1997)
•�Schnolzer M Electrophoresis (1996)
Stable Isotope Labeling Strategies
SStable Isotope Incorporation via Enzyme Reaction
Strengths �� General �� Compatible with any
source of protein �� Constant mass shift
Weaknesses �� Minimal mass difference �� Side reactions
Enzyme in H2O
or H218O
Isotope tagging by chemical reaction
Digest
Label
Inte
nsity
m/z
Prototypical application: •� Isotope coded affinity tags
(Gygi et al, 1999; Zhou et al, 2002)
Stable Isotope Labeling Strategies
IIsotope Coded Affinity Tags (ICAT)
Heavy reagent: d8-ICAT (X =deuterium)
Light reagent: d0-ICAT (X =hydrogen) ICAT Reagents:
Affinity group Labeled linker Reactive group
Gygi et al Nat Biotech, 1999
SStable Isotope Tagging by Chemical Reaction
Strengths �� Compatible with any
protein source �� Selective tagging
reduces sample complexity
�� Different specificities can be designed into reagent
�� Constant mass difference
�� Potentially multiplexed
Weaknesses �� Cys-specific reagents miss
cysteine-free proteins �� Chemical reactions required �� Each specificity requires
different reagent �� Tag might interfere with
MS/MS �� Potential for side reactions,
incomplete reactions �� Potential chromatographic
isotope effect
FFractionations currently used Sub-cellular fractionation
Immunoprecipitation Ion-exchange
Reversed-phase HPLC Isoelectric focusing
combine and
proteolyze
Fractionations
& affinity
enrichment
labeled
cysteines
550 560 570 580 m/z
100
200 400 600 800 m/z
0
100 NH2-EACDPLR-COOH
light heavy
mixture 2 (heavy)
mixture 1 (light)
MS
analysis
quantification
Identification
(MS/MS)
Human myeloid Leukemia (HL-60) cells ��well characterized in vitro model for cell differentiation
+/- 12-phorbol-13-myristate acetate (PMA) ��induces morphological changes ��cells become more adherent
expect to see changes in cell-surface protein profile
Han, et al (2001) Nat Biotech 19:946-951
214 nm
280 nm
pressure
gradient
• HL-60 microsomal fraction • biotinylated Cys residues • combined samples (d0 & d8) • tryptic digest
* ICAT labeled peptides � MS2 (12 peptides total) Circled MS1 peaks shown on next slide
Extensive fractionation: 1.� Cation exchange (SCX) 2.� Affinity chrom (avidin) 3.� Capillary reverse phase
peptide sequencing by MS/MS � • CD45 identified - transmembrane Tyr phosphatase • ATC2 identified - calcium pump
calculate ratio of light:heavy from MS1: CD45 = 1:0.7 and ATC2 = 1:1.2
LC retention time differs by 4 s
d0:d8 ratio = 1:0.77 +/- 0.05
unchanged abundance:
ribosomal proteins, cytoskeletal proteins, metabolic enzymes, cell-surface receptors, channel proteins
changed abundance:
membrane associated signal transduction proteins ex. farnesyl-
diphosphate farnesyl transferase (20-fold reduction)
• n range: 2-36
• d0:d8 range: 0.05-11.45
down-regulated w/ PMA treatment
•
•
•
•
up-regulated w/ PMA treatment
491 proteins identified from microsomal fraction of HL-60 cells
total analysis time = 50 h
100 min
one SCX fraction
1025 proteins
need 2 peptides from same protein
to confirm
CConsequence: Identification of many un-interesting proteins
Number of proteins identified
Quantitative Analysis of Androgen-regulated microsomal proteins from LNCaP prostate epithelia
Up to ~90% of identified proteins show un-changed abundance
Log
d0/d
8
mixture 2 (heavy) mixture 1 (light)
300 600 900 1200 m/z
m/z
Isotope tag
Combine and digestion
Affinity purification
Fractionation on MALDI sample plate for MS
Identify differential expressed peptides for MS/MS
ignore peptides with unchanged abundance
QQuantitative Analysis Based on Tandem Mass Spectrometry
Ross et al MCP, 3(12), 1154
iTRAQTM (Applied Biosystems):
QQuantitative Analysis Based on Tandem Mass Spectrometry
Ross et al MCP, 3(12), 1154
TPHPALTEAK + 114/115/116/117 (1:1:1:1)
MS
MS/MS
LLabel-free Quantitation
1. Peak detection;
2. Peak matching;
3. Peak normalization;
4. Area/ height
measurement;
5. Statistical evaluation.
1. # of spectra correlates
with protein abundance;
2. Normalization;
3. Statistical evaluation.
Protein Post-Translational Modifications (PTMs)
Histone modification:
�� Highly dynamic
�� Low abundance
�� Low ionization efficiency
�� Poor fragmentation Protein
Protein
P
kinase phosphatase
Case Studies with Phosphoproteomics
A typical workflow for phosphoproteomics
Digestion Enrichment
MS analysis
P P
Identification
Quantitation
P
P
Fractionation Labeling
Sampling
Separation
Annotation
Enrichment of phosphopeptides
�� Metal oxide: TiO2; ZrO2
�� Immobilized metal ion affinity chromatography (IMAC):
Fe(III); Ga(III)
�� PolyMAC: polymer-based metal ion affinity capture
Ti
Ti
Ti P
Solid-phase beads PolyMAC
Mixture of peptides
Identification of Post-translational Modifications
Static
�� Simply change mass of any residue (or N- or C-terminus).
�� Adds no additional computer processing time.
�� Preprocessed or pre-calculated peptide masses no longer valid
Variable/Differential
�� Specified residues need to be considered as un-modified or modified in all
combinations.
�� Search complexity increases (a lot).
�� Search time increases dramatically.
�� Preprocessed or pre-calculated peptide masses no longer directly valid
Identification of protein-protein interactions
Immunopurified
sample
Proteolysis
and Separation
μLC-MS
&
MS/MS
Immunopurified
control
Proteolysis
and Separation
μLC-MS
&
MS/MS
control
sample
rol
l
72 120 15
Advantage: One of the most sensitive and univeral methods to identify
interacting partners
Disadvantage: Difficult to remove contaminants