protein identification via database searching attila kertész-farkas [email protected] protein...
TRANSCRIPT
Protein Identification via Database searching
Attila Kerté[email protected]
Protein Structure and Bioinformatics Group, ICGEB, Trieste
Mass Spectra analysis
Biological sample
Results report
Mass Spectra analysis
Biological sample
Results report
Computational analysis of MS/MS
• Two approaches:– De novo sequencing– Database searching based– Hybrid
De novo sequencing
De novo sequencing
• – can identify new peptides and proteins– Able to discover (new) PTMs– Independent of protein databases
• – Requires MS/MS data of good quality– No statistics based validation
Database searching-based MS/MS tandem mass spectra identification
• Pipeline
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Database searching-based MS/MS tandem mass spectra identification
• Pipeline
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Database searching-based MS/MS tandem mass spectra identification
• Pipeline
Input data Peptide identification Validation Protein
inference
Quantitation
Interpretation
Data formats Database searching
Statistical methods for validations
Protein assembling
• Mass spectrum:– Histogram of the mass over charge of the
observed fragment ions.– Spectrum normalization. Usually intensity is scaled
to [0,100] interval.
Input data
Peptide assignment Validation Protein
inference
Quantitation
Interpretation
• Most common formats are the mzXML, MGF and DAT,
Input data
Peptide assignment Validation Protein
inference
Quantitation
Interpretation
MGF file format
Input data
Peptide assignment Validation Protein
inference
Quantitation
Interpretation
.mzXML
Input data
Peptide assignment Validation Protein
inference
Quantitation
Interpretation
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:1. 2
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Input dataExperimental Spectra
Scores:1. 22. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6b7
b8
b9
b10
y1
y2
y3
y4y5
y6
y7
y8
y9
y10
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:3. 41. 22. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Input dataExperimental Spectra
Scores:3. 41. 22. 14. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Input dataExperimental Spectra
Scores:3. 41. 22. 14. 15. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
Input dataExperimental Spectra
Scores:3. 41. 22. 22. 14. 15. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6b7
b8
b9
b10
y1
y2
y3
y4y5
y6
y7
y8
y9
y10
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE
Input dataExperimental Spectra
Scores:3. 414. 3 1. 22. 27. 22. 14. 19. 112. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6b7
b8
b9
b10
y1
y2
y3
y4y5
y6
y7
y8
y9
y10
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE
Input dataExperimental Spectra
Scores:15. 323. 414. 3 1. 22. 27. 22. 14. 19. 112. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6b7
b8
b9
b10
y1
y2
y3
y4y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:15. 323. 414. 3 1. 22. 27. 22. 14. 19. 112. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Protein sequence DB
Score: 32Peptide: SHLITLLLFLFHSETICR
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
Input dataExperimental Spectra
Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Input dataExperimental Spectra
Scores:11. 36. 39. 33. 31. 34. 27. 213. 21. 110. 1
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLR
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDK
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:1. 2
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:1. 2
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
1.
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:1. 2
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:
Protein sequence DB
1.
2.
Shared Peak Count (SPC)This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
Shared Peak Count (SPC)This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
SPC = 7
Inner product (I)This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
Inner product (I)This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
I = 3.5
Hyperscore: H = I*Nb!*Ny!I is the sum of the intensity of the matched peaksNb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum! is the factorial function.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
bb bb by y y yy
Hyperscore: H = I*Nb!*Ny!- I is the sum of the intensity of the matched peaks- Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum- ! is the factorial function.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
bb bb by y y yy
H = 3.2*3!*4! = 3.2*6*24 = 460.8
Xcorr
q is the query spectrumt is the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
75
75])[,(
151
1),(),(
iitqItqItqXcorr
Xcorr
q is the query spectrumt is the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
75
75])[,(
151
1),(),(
iitqItqItqXcorr
I(q,t)=3.2
Xcorr
q is the query spectrumt is the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
75
75])[,(
151
1),(),(
iitqItqItqXcorr
I(q,t)=3.2
I(q,t[-75])=
Xcorr
q is the query spectrumt is the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
75
75])[,(
151
1),(),(
iitqItqItqXcorr
I(q,t)=3.2
I(q,t[-32])=
Xcorr
q is the query spectrumt is the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
75
75])[,(
151
1),(),(
iitqItqItqXcorr
I(q,t)=3.2
I(q,t[0])=
Xcorr
q is the query spectrumt is the theoretical spectrum
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison:1.
100%
0%1
0
75
75])[,(
151
1),(),(
iitqItqItqXcorr
I(q,t)=3.2
I(q,t[32])=
And so on.
Protein Sequence Databases– Completeness:
• Complete• Longer searching time
– Redundancy:• Sequence variations can be found• Redundant database can mess up the statistics
– Quality of sequence annotationProtein sequence DB
2.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
• Entrez Protein DB– http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein– Most complete, redundant
• Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL)– http://www.ncbi.nlm.nih.gov/RefSeq/– http://www.uniprot.org/– Well annotated, non-redundant
• International Protein Index (IPI)– http://www.ebi.ac.uk/IPI/IPIhelp.html– Represents a good balance between redundancy and
completeness. – Contains cross-reference to Ensemble, UniProt, RefSeq.
• Sequences from a single genome– Difficult to obtain good statistics on small datasats.
Protein sequence DB
2.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Protein sequence DB
2.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
• Taxonomy• Allows searches to be limited to entries from particular
species or groups of species.• Speed up a search, and ensures that the hit list will only
contain entries from the selected species.• For non-redundant databases, a single entry may represent
identical sequences from multiple species. The accession string and title text from the FASTA entry, listed on the master results page, will usually describe just one of these entries. To see the equivalent entries, and to explore their taxonomy, follow the accession number link in the results list to the Protein View. If the hit is from a non-redundant database, and represents multiple entries with identical sequences, the Protein View will include links to NCBI Entrez and the NCBI Taxonomy Browser for all equivalent entries.
Run time• Database search has to enumerate all
peptides and compare them to all experimental spectra.
• This can be slow with large protein sequence databases especially when slow scoring function is applied, like Xcorr.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Speedup techniques• Fast database indexing
– Fast implementation of sequence indexing in the database
• Parent mass check– PTMs can be lost
• Sequest’s preliminary score• Tag-based filtering (de novo hybrid)
– Increases the specificity(or sensitivity)
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
• Advanced database indexing– Better implementation of the sequence indexing– Better representation of protein sequences.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:1. 2
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison
Protein sequence DB
|)()(| tPMqPM
Parent mass check
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
>IPI:IPI00000044.1|SWISS-PROT:P01127MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
Input dataExperimental Spectra
Scores:Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Spectra comparison
Protein sequence DB
|)()(| tPMqPM
Parent mass check
Fast prescoring (used in SEQUEST)So called Sp score:
R(q,t) is the maximum number of consecutive matched b-y ions.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
100%
0%1
0
t)SPC(t,
)),(0075.01(),(),(),(
tqRtqSPCtqItqSP
Sp=3.2*7*(1+0.0075*4)/10=2.3072
SEQUEST selects the top 500 scoring peptides, scored by Sp, and rescores them using the Xcorr.
Sequence tag based filtering• Extract short amino acid tags from the
experimental spectra, • Using spectrum graph, where nodes are the
peaks, masses which differ by the mass of an amino acid are linked by an edge.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
W
RA
C
VG
E
K
DW
QP
T
LT
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
WR
A
C
VG
E
K
DW
LP
T
L T
TAG Prefix Mass
AVG 0.0
WTD 120.2
PET 211.4
• Generates short peptide sequence tags from the spectrum, and uses these tags to filter the protein sequence database.
• Tags make database search much faster, analogous to the way that BLAST’s filter speeds up sequence search.
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Tag-based filteringMDHPEDESHSEK
QDDEEALARLEEIK
SIEAKLTLR
QNNLNPERPDSAYLR
LKQINEEQREGLR
FVSEAVTAICEAK
SSDIQAAVQICSLLHQR
EFSASLTQGLLK
SAEDLEADK
MDHPEDESHSEK
QDDEEALARLEEIK
SIEAKLTLR
QNNLNPERPDSAYLR
LKQINEEQREGLR
FVSEAVTAICEAK
SSDIQAAVQICSLLHQR
EFSASLTQGLLK
SAEDLEADK
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Summary• Experimental spectra are compared to protein
sequence database.• Scoring function,• Protein Database,• Speedup techniques,
Input data Peptide assignment
Validation Protein inference
Quantitation
Interpretation
Validation
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Scores:15. 323. 414. 3 1. 22. 27. 22. 14. 19. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Scores:11. 36. 39. 33. 31. 34. 27. 213. 21. 110. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
How can peptide assignments be
approved or rejected automatically?
Why is it necessary?
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
•Human judgment is biased and can be unreliable, •Millions of spectra per day,•Very difficult by looking at the spectrum visually.
Why is it necessary to do it automatically?
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Two computational approaches:• Relative score• probability based scoring
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Relative score:SEQUEST: delta score
1
21
s
ssCn
Scores:15. 323. 414. 3 1. 22. 27. 22. 14. 19. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Scores:13. 46. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Scores:11. 36. 39. 33. 31. 34. 27. 213. 21. 110. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Keep the peptide assignment that exceeds a certain limit.
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Keep the peptide assignment that exceeds a certain limit.
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Keep the peptide assignment that exceeds a certain limit.
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Keep the peptide assignment that exceeds a certain limit.
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Keep the peptide assignment that exceeds a certain limit.
Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Input dataExperimental Spectra
Protein sequence DB
Score: 4Peptide: AELDLNMTR
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 3Peptide: MEICRGLRScore: 15Peptide: LLHGDPGEEDKScore: 4Peptide: MDHPEDESHSEKScore: 5Peptide: SAEDLEADK
Score: 3Peptide: SIEAKLTLR
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Cn=(32-4)/32=0.875
Cn=(4-4)/4=0
Cn=(3-3)/3=0
Cn=(15-4)/15=0.733
Keep the peptide assignment that exceeds a certain limit.
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based peptide assignment validation:
Compute the statistical significance of the score. The statistical significance of a score s is the probability of observing a random score x that is higher or equal that the score s, formally P(s <= x). This probability is called the p-value.3 approaches: 1. using analytical functions,2. Fitting a distribution of the sample of random scores.3. non-parametric approach.
Compute the probability that the peptide assignment with the corresponding score is correct.
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based peptide assignment validation:
The probability based approach means, very loosely speaking, how far the score is from the random.
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based peptide assignment validation:
Random score is a score obtained by a comparison between a randomly selected experimental and a randomly selected theoretical spectrum. This random score has a probability density distribution, and it depends on the scoring functions. As a null hypothesis.
T hscore
probability distributionof random scores
p-value of hit h
Fre
quen
cy
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based peptide assignment validation:
The distribution depends on the scoring function.
Random matches caused by match with noise
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based peptide assignment validation:
1. Analytical function. Depends on the scoring function. And the parameters are calculated from the spectra to be compared.
1. In the case of the SPC scoring function, the distribution of the random scores can be modeled with hyper geometrical distribution. 2. In the case of the inner product scoring function, the random scores can be modeled with normal distirbution.
T hscore
probability distributionof random scores
p-value of hit h
Fre
quen
cy
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based approach:
Build a histogram of the scores that were obtained during the comparison. Fit a known distribution function, and use this for calculation of the p-value of the top score.
0 5 10 15 20 250
0.05
0.1
0.15
0.2
0.25
0.3
Match
Fre
quen
cy
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Probability based approach:
Decoy approach.Make a dummy dataset, big enough to obtain solid statistics. Decoy dataset can be made by: 1. random shuffling 2. Markov-chain generated amino acid sequences3. more typically, by simply reversing the sequence of proteins in
the database. Sometimes it is called reverse database.
No correct matches are expected from the decoy dataset, so the scores obtained on Decoy dataset are used for excellent estimate of random distribution.
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6b7
b8
b9
b10
y1
y2
y3
y4y5
y6
y7
y8
y9
y10
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Spectra comparison:
Protein sequence DB
Input dataExperimental Spectra
>IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE
Decoy Protein sequence DB
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/z
inte
nsity
(%
)
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
100 200 300 400 500 600 700 800 900 1000 11000
20
40
60
80
100
120
m/zin
tens
ity (
%)
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Spectra comparison:
Protein sequence DB
Input dataExperimental Spectra
>Decoy_protein_sequence_1EDEQFYFKTVMVGEDPMNTRLSVPQDAEMATCLFWGPCAASEFSTTPGSDSRIFAFRKDQKRNESLDTINVAELQLRTEDGSKVCSLCMKGGHIGLFLAHPEIPVVDIKEELNVNPGQLYGAVLQNNRLYFTKQNVDWIRFAQMKSSKRGSPRCITESHFLFLLLTILHSRLGRCIEM
Decoy Protein sequence DB
Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Protein sequence DB
Input dataExperimental Spectra
Decoy Protein sequence DB
Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1
0 5 10 15 20 250
0.05
0.1
0.15
0.2
0.25
0.3
Match
Fre
quen
cy
Can provide more accurate random distribution model. Doubles the execution time.
Frequently applied approach!
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Protein sequence DB
Input dataExperimental Spectra
Decoy Protein sequence DB
Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1
Non-parametric approach.Instead of fitting probability density function to the histogram:Calculate the percentage of the scores on the decoy dataset, equal or higher score than the actual top score.
0.0scores} {all#
15} score{decoy #
T hscore
A
B
probability distributionof random scores
probability distributionof correct scores
p-value of hit h
Fre
quen
cy
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Protein sequence DB
Decoy Protein sequence DB
Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1
False Positive Rate (FPR), the probability of labelling a random score significant (area B in the figure). A FPR of 0.01 means that 1% of the random scores are labelled significant.E-value: The E-value of a query is the expected number for finding a database element with random score greater than or equal to the query hit s on a database of n data. For instance, an E-value of 10-2 means that the score h is expected to occur by chance only once in 100 independent similarity searches over the database. If the E-value is 10, then ten random hits with score greater or equal to h are expected within a single similarity search.
T hscore
A
B
probability distributionof random scores
probability distributionof correct scores
p-value of hit h
Fre
quen
cy
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation Scores:13. 156. 41. 49. 34. 33. 27. 211. 28. 110. 12. 15. 112. 1
Protein sequence DB
Decoy Protein sequence DB
Decoy Scores:5. 43. 44. 410. 38. 37. 32. 26. 21. 212. 19. 111. 1
False Discovery Rate, the ratio of random scores within significant scores, formally FDR=A/(A+B). The FDR = 0.01 means the 1% of the scores labelled significant are actually observed by chance. FDR is often used to control the ratio of the false positives. The threshold T can be set to keep the FDR under a certain level, typical levels are 0.01 or 0.05, i.e experimenters set thresholds to allow 1% or 5% of false positives. The lower the FDR the more true (non-random) similarity hits are lost. Decoy dataset is used to calculate the FDR.
Input data Peptide assignment Validation
Protein inference
Quantitation
Interpretation
Summary:1. Peptide assignment has to be validated. 2. Relative scoring or probability based scoring can be
applied.3. False positives (false assignments) can be kept under a
certain level.
Protein Inference
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Take the peptides that passed the validation.
This section is about to infer the proteins that could produces these peptides. The task is not trivial.
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Input dataExperimental Spectra
Score: 32Peptide: SHLITLLLFLFHSETICR
Score: 15Peptide: LLHGDPGEEDK
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Peptides:
MDHPEDESHSEK
QDDEEALARLEEIK
SIETLR
QNNLNPERPDSAYLR
LKQINEEQREGLR
FVSEAVTAICEAK
SSDIQAAVQICSLLHQR
EFSASLTQGLLK
SAEDLEADK
Proteins:
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Peptides:
MDHPEDESHSEK
QDDEEALARLEEIK
SIETLR
QNNLNPERPDSAYLR
LKQINEEQREGLR
FVSEAVTAICEAK
SSDIQAAVQICSLLHQR
EFSASLTQGLLK
SAEDLEADK
Proteins:
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Peptides:
MDHPEDESHSEK
QDDEEALARLEEIK
SIETLR
QNNLNPERPDSAYLR
LKQINEEQREGLR
FVSEAVTAICEAK
SSDIQAAVQICSLLHQR
EFSASLTQGLLK
SAEDLEADK
Proteins:
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
By Occam’s razor, the Protein A should be preferred. Protein A, B ad C can be homologous proteins
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation
Many models have been develop to cope with to this problem.Statistical based, Graph theory and spectral Network based.Well-known method ProteinProphet.
Summary
Input data Peptide identification Validation Protein
inference
Quantitation
Interpretation
Data formats Database searching
Statistical methods for validations
Protein assembling
Database Searching•
– Simple and straightforward– Has a limited search space.– Completeness– Statistical analysis can be carried out.
• – Has a limited search space. Limited to the database.– Enumerating all candidates is too slow, particularly when
modifications and non-tryptic peptides must be considered. (A modern instrument produces million spectra per day)
Input data Peptide assignment Validation Protein
inference
Quantitation
Interpretation