ch08 massspec

161
www.bioalgorithms.info An Introduction to Bioinformatics Algorithms Protein Sequencing and Identification by Mass Spectrometry

Upload: bioinformaticsinstitute

Post on 25-Dec-2014

131 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Ch08 massspec

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Protein Sequencing and

Identification by Mass

Spectrometry

Page 2: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline• Tandem Mass Spectrometry

• De Novo Peptide Sequencing

• Spectrum Graph

• Protein Identification via Database Search

• Identifying Post Translationally Modified Peptides

• Spectral Convolution

• Spectral Alignment

Page 3: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Different Amino Acid Have Different Masses

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

Page 4: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Fragmentation

• Peptides tend to fragment along the backbone.

• Mass spectrometer is a sophisticated

(and rather expensive!) scale to measure

the masses of these fragments

H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

H+

Prefix Fragment Suffix Fragment

Collision Induced Dissociation

Page 5: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Breaking Protein into Peptides and

Peptides into Fragment Ions• Most mass spectrometers can only measure masses of

short peptides (e.g., 20 amino acids or less) rather than masses of entire proteins (usually hundreds of amino acids). That‟s why:

• Proteases, e.g. trypsin, break protein into short peptides.

• A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece.

• Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones.

• Mass Spectrometer measure mass/charge ratio of an ion.

Page 6: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

N- and C-terminal Peptides

Page 7: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Terminal peptides and ion types

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Page 8: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Masses of fragment ions

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Peptide

Mass (D) 57 + 97 + 147 + 114 – 18 = 397

without

Page 9: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

N- and C-terminal Peptides

415

486

301

154

57

71

185

332

429

Page 10: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

N- and C-terminal Peptides

415

486

301

154

57

71

185

332

429

Page 11: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Theoretical Spectrum

415

486

301

154

57

71

185

332

429

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum)

57 71 154 185 301 332 415 429 486

Page 12: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reconstructing Peptides

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum)

57 71 154 185 301 332 415 429 486

Page 13: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reconstructing Peptides

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154

Page 14: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reconstructing Peptides

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 370 387 409 415 423 429 460 472 486

Page 15: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Fragmentation

y3

b2

y2 y1

b3a2 a3

HO NH3+

| |

R1 O R2 O R3 O R4

| || | || | || |

H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH

| | | | | | |

H H H H H H H

b2-H2O

y3 -H2O

b3- NH3

y2 - NH3

Page 16: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass Spectra

G V D L K

mass

0

57 Da = „G‟ 99 Da = „V‟LK D V G

• The peaks in the mass spectrum:

• Prefix

• Fragments with neutral losses (-H2O, -NH3)

• Noise and missing peaks.

and Suffix Fragments.

D

H2O

Page 17: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Protein Identification with MS/MS

mass

0

Inte

nsity

mass

0

MS/MS

Peptide

Identification

Protein database

Page 18: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tandem Mass Spectrum• Tandem Mass Spectrometry mainly generates N-

and C-terminal fragment ions

• Chemical noise often complicates the spectrum.

• Represented in 2-D: mass/charge axis vs. intensity

axisS#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Page 19: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tandem Mass-Spectrometry

Page 20: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Breaking Proteins into Peptides

peptides

MPSER

……

GTDIMR

PAKID

……

HPLCTo

MS/MSMPSERGTDIMRPAKID......

protein

Page 21: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass SpectrometryMatrix-Assisted Laser Desorption/Ionization (MALDI)

From lectures by Vineet Bafna (UCSD)

Page 22: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tandem Mass SpectrometryRT: 0.01 - 80.02

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

Time (min)

0

10

20

30

40

50

60

70

80

90

100

Rela

tive A

bundance

13891991

1409 21491615

1621

14112147

1611

199516551593

138721551435

19872001 2177

1445 16611937

22051779 2135

20171313 22071307

23291105 17071095

2331

NL:

1.52E8

Base Peak F: +

c Full ms [

300.00 -

2000.00]

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Scan 1708

LC

S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7

F: + c Full ms [ 300.00 - 2000.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

638.0

801.0

638.9

1173.8

872.3 1275.3

687.6944.7 1884.51742.11212.0783.3 1048.3 1413.9 1617.7

Scan 1707

MS

MS/MSIon

Source

MS-1collision

cell MS-2

Page 23: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Protein Identification by Tandem

Mass Spectrometry (MS/MS)

S

e

q

u

e

n

c

e

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

MS/MS instrument

database search

•Sequest, Mascot, etc

de novo interpretation

•Lutefisk, Peaks, etc

Page 24: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

ance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

V

GE

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

De Novo

AVGELTK

Database

Search

Database of all peptides = 20n

AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,

AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,

AVGELTI, AVGELTK , AVGELTL, AVGELTM,

YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY

Database of

known peptides

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database of

known peptides

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,

HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..

Mass, Score

Page 25: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De Novo vs. Database Search: A Paradox

• The database of all peptides is huge ≈ 20n peptides of length n

• The database of all known peptides is much smaller ≈ 108 peptides

• However, de novo algorithms can be much faster, even though their search space is much larger!

• A database search scans all peptides in the database of all knownpeptides to find best one.

• De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.

Page 26: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Three Algorithmic Problems• Searching for a million words in a text. Suppose it takes

1 sec to find a word in a text. How much time would it take to find 1 million words in the text?

• Searching for a word without even looking at 99.999% of the text. Suppose you search for a word in a text. Would it be possible to ignore 99.999% of the text, scan only the remaining part and guarantee that the word you are looking for will be found?

• Finding spelling errors in a book written in an unknown language. Given a book (in an unknown language) and a misspelled word (with insertions, deletions, and substitutions of letters) correct spelling errors in the word.

Page 27: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Three Algorithmic Problems• Searching for a million words in a text. Suppose it takes

1 sec to find a word in a text. How much time would it take to find 1 million words in the text? 1 million seconds?

• Searching for a word without even looking at 99.999% of the text. Suppose you search for a word in a text. Would it be possible to ignore 99.999% of the text, scan only the remaining part and guarantee that the word you are looking for will be found?

• Finding spelling errors in a book written in an unknown language. Given a book (in an unknown language) and a misspelled word (with insertions, deletions, and substitutions of letters) correct spelling errors in the word.

Page 28: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Genomics: Problems Solved. • Searching for a million words in a text.

Aho-Corasik algorithm takes roughly the same time with a million words as it takes with a single word.

• Searching for a word without even looking at 99.999% of the text.

Filtration algorithms (like FASTA or BLAST) ignore 99.999% of the text.

• Finding spelling errors.

Sequence alignment algorithms (like Smith-Waterman) do it in quadratic time

Page 29: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Proteomics: Three Problems• Comparing a million spectra against a database. Suppose it

takes 1 sec to interpret a spectrum. How much time would it take to

interpret 1 million spectra?

• Mass-spectrometry database search without even looking at

99.999% of the database. Suppose you compare a spectrum

against a database. Would it be possible to ignore 99.999% of the

database, scan only the remaining part and guarantee that you still

can identify a peptide of interest?

• Blind PTM search and discovery of new PTM types. Given a

spectrum of a peptide with unknown PTM types, find this peptide in

the database. Discover new PTM types by data mining of large

MS/MS datasets.

Page 30: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Three Solutions• Comparing a million spectra against a database.

InsPecT (Tanner et al., Anal. Chem, 2005)

• MS/MS database search without even looking at

99.999% of the database.

InsPecT (Tanner et al., Anal. Chem, 2005)

• Blind PTM search and discovery of new PTM types. Given a spectrum of a peptide with unknown PTM types, find this peptide in the database. Discover new PTM types by data mining of large MS/MS datasets.

MS-Alignment (Tsur et al., Nature Biotech., 2005)

Page 31: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filtration: Combining De Novo Sequencing and

Database Search in Mass-Spectrometry

• So far de novo and database search were presented as two separate techniques

• Database search is rather slow: many labs generate more than 100,000 spectra per day. SEQUEST takes approximately 1 minute to compare a single spectrum against SWISS-PROT (54Mb) on a desktop.

• It will take SEQUEST more than 2 months to analyze the MS/MS data produced in a single day.

• Can slow database search be combined with fast de novo analysis?

Page 32: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De novo Peptide Sequencing

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Sequence

Page 33: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Building Spectrum Graph

• How to create vertices (from masses)

• How to create edges (from mass differences)

• How to score vertices

• How to score paths

• How to find the best path

Page 34: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S E Q U E N C E

b-ions (prefix or N-terminal ions)

Mass/Charge (M/Z)

Page 35: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

a-ions = b-ions - CO = b-ions - 28

Mass/Charge (M/Z)

S E Q U E N C E

Page 36: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S E Q U E N C E

Mass/Charge (M/Z)

Shifting Peaks: a-ions = b-ions - CO = b-ions - 28

Page 37: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

y-ions (suffix of C-terminal ions)

Mass/Charge (M/Z)

E C N E U Q E S

Page 38: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)

Inte

nsi

ty

Page 39: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)

Inte

nsi

ty

Page 40: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

noise

Mass/Charge (M/Z)

Page 41: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MS/MS Spectrum

Mass/Charge (M/z)

Inte

nsi

ty

Page 42: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Some Mass Differences between Peaks

Correspond to Amino Acids

s

ss

e

ee

e

e

e

e

e

q

q

qu

u

u

n

n

n

e

c

c

c

Page 43: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Some Mass Differences between Peaks

Correspond to Amino Acids

s

ss

e

ee

e

e

e

e

e

q

q

qu

u

u

n

n

n

e

c

c

c

Page 44: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ion Types

• Some masses correspond to fragment ions,

others are just random noise

• Knowing ion types Δ={δ1, δ2,…, δk} allows one to

distinguish fragment ions from noise

• We can learn ion types δi and their probabilities qi

by analyzing a large test sample of annotated

spectra.

Page 45: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Example of Ion Type

• Δ={δ1, δ2,…, δk}

• Ion types

{b, b-NH3, b-H2O, b-CO}

correspond to

Δ={0, 17, 18, 28}

*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

Page 46: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Match between Spectra and the

Shared Peak Count

• The match between two spectra is the number of masses (peaks)

they share (Shared Peak Count or SPC)

• In practice mass-spectrometrists use the weighted SPC that

reflects intensities of the peaks

• Match between experimental and theoretical spectra is defined

similarly

Page 47: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Sequencing Problem

Goal: Find a peptide with maximal match between

an experimental and theoretical spectrum.

Input:

• S: experimental spectrum

• Δ: set of possible ion types

Output:

• A peptide whose theoretical spectrum

matches the experimental spectrum the best

Page 48: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S E Q U E N C E

Mass/Charge (M/Z)

Shifting Peaks: a-ions = b-ions - CO = b-ions - 28

Page 49: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reverse Shifts

Shift in H2O+NH3

Shift in H2O

Page 50: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Vertices of the Spectrum Graph• Masses of potential N-terminal peptides

• Vertices are generated by reverse shifts corresponding to ion types

Δ={δ1, δ2,…, δk}

• Every N-terminal peptide can generate up to k ions

m-δ1, m-δ2, …, m-δk

• Every mass s in an MS/MS spectrum generates k vertices

V(s) = {s+δ1, s+δ2, …, s+δk}

corresponding to potential N-terminal peptides

• Vertices of the spectrum graph:

{initial vertex} V(s1) V(s2) ... V(sm) {terminal vertex}

Page 51: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Edges of the Spectrum Graph

• Two vertices with mass difference corresponding to

an amino acid A:

• Connect with an edge labeled by A

• Gap edges corresponding to the mass of pairs of

amino acids (optional)

Page 52: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Paths

• Paths in the labeled graph spell out amino

acid sequences

• There are many paths, how to find the correct

one?

• We need scoring to evaluate paths

Page 53: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Path Score

• p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

• p(P, s) = the probability that peptide Pgenerates a peak s

• Scoring = computing probabilities

• p(P,S) = πsєS p(P, s)

Page 54: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• For a position t that represents ion type dj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score

Page 55: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peak Score (cont’d)

• For a position t that is not associated with an

ion type:

qR , if peak is generated at t

pR(P,st) =

1-qR , otherwise

• qR = the probability of a noisy peak that does

not correspond to any ion type

Page 56: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Finding Optimal Paths in the Spectrum Graph

• For a given MS/MS spectrum S, find a

peptide P’ maximizing p(P,S) over all

possible peptides P:

• Peptides = paths in the spectrum graph

• P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax

Page 57: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ions and Probabilities

• Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

• δi-ions of a partial peptide are produced independently with probabilities qi

Page 58: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ions and Probabilities

• A peptide has all k peaks with probability

• and no peaks with probability

• A peptide also produces a ``random noise''

with uniform probability qR in any position.

k

i

iq1

k

i

iq1

)1(

Page 59: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ratio Test Scoring for Partial Peptides

• Incorporates premiums for observed ions

and penalties for missing ions.

• Example: for k=4, assume that for a partial

peptide P‟ we only see ions δ1,δ2,δ4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321

)1(

)1(

Page 60: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Peptides

• T- set of all positions.

• Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of partial peptides Pi.

• A peak at position tδj is generated with probability qj.

• R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

Page 61: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Probabilistic Model

• For a position t δj Ti the probability p(t, P,S) that peptide P produces a peak at position t.

• Similarly, for t R, the probability that P produces a

random noise peak at t is:

otherwise1

position tat generated ispeak a if),,(

j

j

j

q

qSPtP

otherwise1

position tat generated ispeak a if)(

R

R

Rq

qtP

Page 62: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Probabilistic Score

• For a peptide P with n amino acids, the score

for the whole peptides is expressed by the

following ratio test:

n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

Page 63: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6

T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

ance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

V

GE

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

De Novo

AVGELTK

Database

Search

Database of

known peptides

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database of

known peptides

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,

HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN..

Page 64: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De Novo vs. Database Search: A Paradox

• de novo algorithms are much faster, even though their search space is much larger!

• A database search scans all peptides to find the best one.

• De novo eliminates the need to scan all peptides by modeling the problem as a graph search.

Why not sequence de novo?

Page 65: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Why Not Sequence De Novo?• De novo sequencing is not very accurate:

• Less than 30% of the peptides sequenced were completely correct!

Algorithm Amino

Acid

Accuracy

Whole Peptide

Accuracy

Lutefisk, 1997 0.566 0.189

SHERENGA, 1999 0.690 0.289

Peaks, 2003 0.673 0.246

PepNovo, 2005 0.727 0.296

Page 66: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Pros and Cons of de novo Sequencing

• Advantage:• Gets the sequences that are not necessarily in the

database.

• An additional similarity search step using these sequences may identify the related proteins in the database.

• Disadvantage:• Requires higher quality spectra to be accurate.

Page 67: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Sequencing Problem

Goal: Find a peptide with maximal match between

an experimental and theoretical spectrum.

Input:

• S: experimental spectrum

• Δ: set of possible ion types

Output:

• A peptide whose theoretical spectrum

matches the experimental S spectrum the best

Page 68: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Identification ProblemGoal: Find a peptide from the database with

maximal match between an experimental and theoretical spectrum.

Input:

• S: experimental spectrum

• database of peptides

• Δ: set of possible ion types

Output:

• A peptide from the database whose theoretical spectrum matches the experimental Sspectrum the best

Page 69: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MS/MS Database Search

Database search in mass-spectrometry has been successful in identification of already known proteins.

Experimental spectrum can be compared with theoretical

spectra of database peptides to find the best fit.

SEQUEST (Yates et al., 1995)

But reliable algorithms for identification of modified

peptides is a much more difficult problem.

Page 70: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The dynamic nature of the proteome

• The proteome of the cell is changing

• Various extra-cellular, and other signals activate various protein pathways.

• A key mechanism of protein activation is post-translational modification (PTM)

• These pathways may lead to other genes being switched on or off

• Mass spectrometry is key to probing the proteome and detecting PTMs

Page 71: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Post-Translational Modifications

Proteins are involved in cellular signaling and metabolic regulation.

They are subject to a large number of biological modifications.

Most protein sequences are post-translationally modified and 600 types (!) of modifications of amino acid residues are known.

Page 72: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Examples of Post-Translational

Modification

Post-translational modifications increase the number of “letters” in

amino acid alphabet and lead to a combinatorial explosion in both

database search and de novo approaches.

Page 73: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Masses

415

486

301

154

57

71

185

332

429

Page 74: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Masses: Modification +16 on N

415

486

301

154

57

71

185

332

429

+16

+16

+16

+16

+16

+16

Page 75: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reconstructing Modified Peptides

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum)

57 71 154 185 301 332 415 429 486

+16 +16 +16 +16 +16

57 71 154 201 301 348 431 445 502

Page 76: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reconstructing Modified Peptides

Reconstruct peptide from the set of masses of fragment ions

(mass-spectrum) 57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486

57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 348 387 409 415 423 431 445 472 502

Page 77: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Identification Problem Revisited

Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:

• S: experimental spectrum

• database of peptides

• Δ: set of possible ion types

Output:

• A peptide from the database whose theoretical spectrum matches the experimental Sspectrum the best

Page 78: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Modified Peptide Identification ProblemGoal: Find a modified peptide from the database with maximal

match between an experimental and theoretical spectrum.

Input:

• S: experimental spectrum

• database of peptides

• Δ: set of possible ion types

• Parameter k (# of mutations/modifications)

Output:

• A peptide that is at most k mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best

Page 79: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Search for Modified Peptides:

Virtual Database Approach

• YFDSTDYNMAK

• 25=32 possibilities, with

2 types of modifications!

Phosphorylation?Oxidation?

Yates et al.,1995: an exhaustive search in a

virtual database of all modified peptides.

Combinatorial explosion, even for a small

set of modifications types.

A larger set of spurious matches must be

filtered out. It‟s much more likely that

incorrect matches will have high scores.

Problem. Extend the virtual database

approach to a large set of modifications.

Page 80: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Restrictive vs Unrestrictive (Blind)

Search for Modified Peptides• Restrictive search requires the researcher to

guess which modification types are present in

the sample

• How would you feel if you were allowed to

perform only 10 out of 20x19=380 possible

point mutations (with all other forbidden)

while aligning two proteins?

• This is exactly what the restrictive PTM

search algorithms do.

Page 81: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Restrictive vs Unrestrictive (Blind) Search

for Modified Peptides

• Restrictive search requires the researcher to guess which modification types are present in the sample

• Blind search performs an unrestrictive search for all

possible modification offsets at once.

Page 82: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Database Search:

Sequence Analysis vs. MS/MS Analysis

Sequence analysis:

similar peptides (that a few mutations apart) have similar sequences

MS/MS analysis:

similar peptides (that a few mutations apart) have dissimilar spectra

Page 83: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Identification Problem: Challenge

Very similar peptides may have very different spectra!

Goal: Define a notion of spectral similarity that correlates well with the sequence similarity.

If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.

Page 84: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Deficiency of the Shared Peaks Count

Shared peaks count (SPC): intuitive measure of spectral similarity.

Problem: SPC diminishes very quickly as the number of mutations increases.

Only a small portion of correlations between the spectra of mutated peptides is captured by SPC.

Page 85: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPC Diminishes Quickly

S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}

S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}

S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}

no mutations

SPC=10

1 mutation

SPC=5

2 mutations

SPC=2

Page 86: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Convolution

)0)((

))((

,

12

12

122211

22111212

:

SS

xSS

ssSsSs

}S,sS:ss{sSS

x

:peak) (SPC count peaks shared The

with pairs of Number

Page 87: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Elements of S2 S1 represented as elements of a difference matrix. The

elements with multiplicity >2 are colored; the elements with multiplicity =2

are circled. The SPC takes into account only the red entries

Page 88: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

1

2

3

4

5

0

-150 -100 -50 0 50 100

150

x

Spectral Convolution: An Example

Page 89: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Comparison: Difficult Case

S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}

Which of the spectra

S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}

or

S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}

fits the spectrum S the best?

SPC: both S’ and S” have 5 peaks in common with S.

Spectral Convolution: reveals the peaks at 0 and 5.

Page 90: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Comparison: Difficult Case

S S’

S S’’

Page 91: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Limitations of the Spectrum Convolutions

Spectral convolution does not reveal that

spectra S and S’ are similar, while spectra S

and S” are not.

Clumps of shared peaks: the matching

positions in S’ come in clumps while the

matching positions in S” don't.

This important property was not captured by

spectral convolution.

Page 92: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Sequence Alignment=Path in a Grid

A R N G A L R

A 1 1

R 1 1

N 1

G 1

Z

A 1 1

L 1

R 1 1

Finding similarities between

two peptides

is equivalent to finding an optimal path in

a Manhattan-like grid (sequence

alignment).

Page 93: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Sequence Alignment=Path in a Grid

A R N G A L R

A 1 1

R 1 1

N 1

G 1

Z

A 1 1

L 1

R 1 1

Finding similarities between

two peptides

is equivalent to finding an optimal path in

a Manhattan-like grid (sequence

alignment). Every horizontal/vertical

segment in this path corresponds to

insertion/deletion of an amino acid.

Page 94: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Sequence Alignment=Path in a Grid

A R N G A L R

A 1 1

R 1 1

N 1

G 1

Z

A 1 1

L 1

R 1 1

Finding similarities between

two peptides

is equivalent to finding an optimal path in a

Manhattan-like grid (sequence alignment).

Every horizontal/vertical segment in this path

corresponds to insertion/deletion of an amino

acid.

Can we find similarities between

a spectrum and a peptide

using a similar approach (spectral alignment)?

Page 95: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Converting Spectra into 0-1 Sequences

• Convert spectrum into a 0-1 string with 1s

corresponding to the positions of the peaks.

Page 96: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Modified peptide

Modifications are modeled as insertion (or deletions)

of blocks of zeroes

000101001010000000110000001001 Spectrum

00010100001-----00010000001001 Peptide

A modification with positive offset - inserting a block of 0s

A modification with negative offset - deleting a block of 0s

Page 97: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectra Comparison vs. String Comparison

• Comparison of theoretical and

experimental spectra (represented as 0-1

strings) corresponds to a (somewhat

unusual) edit distance/alignment

between 0-1 strings where elementary

edit operations are insertions and

deletions of blocks of 0s

• Use sequence alignment algorithms!

Page 98: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment Graph

0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

0

0

0

0

1 1 1 1 1 1

0

0

0

0

1 1 1 1 1 1

0

0

1 1 1 1 1 1

0

0

0

0

1 1 1 1 1 1

0

0

0

0

1 1 1 1 1 1

Horizontal axis:

Experimental spectrum

Vertical axis:

Theoretical spectrum of a

peptide

A

B

C

D

E

Page 99: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment Graph0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

0

0

0

0

1 1 1 1 1 1

0

0

0

0

1 1 1 1 1 1

0

0

1 1 1 1 1 1

0

0

0

0

1 1 1 1 1 1

0

0

0

0

1 1 1 1 1 1

Like in alignment algorithms,

every path in the spectral

alignment graph represents a

possible interpretation of a

spectra.

A path covering maximal

number of 1s is the “best”

interpretation of the spectrum.

Vertical / horizontal segment

in the optimal path are

modifications

A

B

C

D

E

Page 100: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment vs. Sequence Alignment

• Alignment graph with different alphabet and

scoring.

• Movement can be diagonal (matching

masses) or horizontal/vertical

(insertions/deletions corresponding to PTMs).

• At most k horizontal/vertical moves.

Page 101: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Shifts

A = {a1 < … < an} : an ordered set of natural

numbers.

A shift (i, ) is characterized by two parameters,

the position (i) and the length ( ).

The shift (i, ) transforms

{a1, …., an}

into

{a1, ….,ai-1,ai+ ,…,an+ }

Page 102: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Shifts: An Example

The shift (i, ) transforms {a1, …., an}

into {a1, ….,ai-1,ai+ ,…,an+ }

e.g.

10 20 30 40 50 60 70 80 90

10 20 30 35 45 55 65 75 85

10 20 30 35 45 55 62 72 82

shift (4, -5)

shift (7,-3)

Page 103: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment Problem

• Find a series of k shifts that make the sets

A={a1, …., an} and B={b1,….,bn}

as similar as possible.

• k-similarity between sets

• D(k) - the maximum number of elements in

common between sets after k shifts.

Page 104: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Comparing Spectra=Comparing 0-1 Strings

• A modification with positive offset corresponds to inserting a block of 0s

• A modification with negative offset corresponds to

deleting a block of 0s

• Comparison of theoretical and experimental spectra (represented as 0-1 strings) corresponds to a

(somewhat unusual) edit distance/alignment

problem where elementary edit operations are insertions/deletions of blocks of 0s

• Use sequence alignment algorithms!

Page 105: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral ProductA={a1, …., an} and B={b1,…., bn}

Spectral product A B: two-dimensional matrix with nm 1s corresponding to all pairs of

indices (ai,bj) and remaining

elements being 0s.

10 20 30 40 50 55 65 75 85 95

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

SPC: the number of 1s at the main diagonal.

-shifted SPC: the number of 1s on the diagonal (i,i+ )

Page 106: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment: k-similarity

k-similarity between spectra: the maximum number

of 1s on a path through this graph that uses at most

k+1 diagonals.

k-optimal spectral

alignment = a path.

The spectral alignment

allows one to detect

more and more subtle

similarities between

spectra by increasing k.

Page 107: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPC reveals only D(0)=3 matching

peaks.

Spectral Alignment

reveals more

hidden similarities

between spectra:

D(1)=5 and D(2)=8

and detects

corresponding

mutations.

Use of k-Similarity

Page 108: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Black line represent the path for k=0

Red lines represent the path for k=1

Blue lines (right) represents the path for k=2

Page 109: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Convolution’ Limitation

The spectral convolution considers diagonals separately without combining them into feasible

mutation scenarios.

D(1) =10 shift function score = 10 D(1) =6

10 20 30 40 50 55 65 75 85 95

10

20

30

40

50

60

70

80

90

100

10 15 30 35 50 55 70 75 90 95

10

20

30

40

50

60

70

80

90

100

Page 110: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Dynamic Programming for Spectral Alignment

Dij(k): the maximum number of 1s on a path to

(ai,bj) that uses at most k+1 diagonals.

(i’,j’) ~ (i,j): the points are located on the same diagonal

Running time: O(n4 k)

otherwisekD

jijiifkDkD

ji

ji

jijiij ,1)1(

),(~)','(,1)(max)(

''

''

),()','({

)(max)( kDkD ijij

Page 111: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Edit Graph for Fast Spectral Alignment

diag(i,j) – the position

of previous 1 on the

same diagonal as (i,j)

Page 112: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Fast Spectral Alignment Algorithm

1)1(

1)(max)(

1,1

),(

kM

kDkD

ji

jidiag

ij

)(max)( ''),()','(

kDkM jijiji

ij

)(

)(

)(

max)(

1,

,1

kM

kM

kD

kM

ji

ji

ij

ij

Running time: O(n2 k)

Page 113: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment: Complications

Spectra are combinations of an increasing (N-terminal ions) and a decreasing (C-terminal ions) number series.

These series form two diagonals in the spectral product, the main diagonal and the perpendicular diagonal.

The described algorithm deals with the main diagonal only.

Page 114: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment: Complications

• Simultaneous analysis of N- and C-terminal ions

• Taking into account the intensities and charges

• Analysis of minor ions

Page 115: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MS/MS and East-West Traveling Salesman Problem

Page 116: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

EE SE E SE D E SE D T E S

129

244

345

47487

216

317

432

E D T

ESS

E S

T D EE

S

Anti-symmetric paths

E D T

ESS

E S

T D EE

S

Anti-symmetric path problem (Tim Chen, SODA 2001) avoids reusing the same

peak twice (as both b- and y-ions) in the optimal path

Without anti-symmetric condition, the correct peptide SETDE would be

missed and the algorithm would output a symmetric peptide:

ESESEWhy?

De-novo problemInput: Spectrum graph

Output: Anti-symmetric longest path

Page 117: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

PTM Frequency Matrix

A C D E F G H I K L M N P Q R S T V W Y

10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 0

11 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 1

12 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 0

13 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 1

14 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 3

15 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 1

16 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 6

17 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 3

18 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 1

19 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 0

20 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 0

21 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 2

22 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 13

23 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 2

24 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 1

25 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 0

26 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 1

27 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 0

28 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 2

29 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 1

30 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 1

31 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 0

32 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 0

33 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 3

34 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 2

35 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2

50,000 spectra (IKKb sample)

were searched in blind mode,

and identifications with p-value

<0.05 were retained

Shading of the cell (x,y) reflects

the number of annotations with

modification:

(offset x, amino acid y)

Page 118: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

PTM Frequency Matrix

A C D E F G H I K L M N P Q R S T V W Y

10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 0

11 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 1

12 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 0

13 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 1

14 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 3

15 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 1

16 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 6

17 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 3

18 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 1

19 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 0

20 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 0

21 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 2

22 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 13

23 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 2

24 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 1

25 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 0

26 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 1

27 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 0

28 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 2

29 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 1

30 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 1

31 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 0

32 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 0

33 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 3

34 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 2

35 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2

16 on W - Oxidation

16 on M - Oxidation14 on K - Methylation

22 - Sodium

32 on M - Double oxidation

28 on K - Dimethylation

Page 119: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filtration in Similarity Searches

Scoring

Protein Query

Sequence Alignment –Smith-Waterman Algorithm

Sequence matches

Filtration

Sequence Alignment: – BLAST

Database

actgcgctagctacggatagctgatcc

agatcgatgccataggtagctgatcc

atgctagcttagacataaagcttgaat

cgatcgggtaacccatagctagctcg

atcgacttagacttcgattcgatcgaat

tcgatctgatctgaatatattaggtccg

atgctagctgtggtagtgatgtaaga

• BLAST filters out very few correct matches and is almost as accurate as Smith – Waterman algorithm.

Database

actgcgctagctacggatagctgatcc

agatcgatgccataggtagctgatcc

atgctagcttagacataaagcttgaat

cgatcgggtaacccatagctagctcg

atcgacttagacttcgattcgatcgaat

tcgatctgatctgaatatattaggtccg

atgctagctgtggtagtgatgtaaga

Page 120: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filtration and MS/MS

Scoring

MS/MS spectrum

Peptide Sequencing – SEQUEST / Mascot

Sequence matches

Filtration

Database

MDERHILNMKLQWVCSDLPT

YWASDLENQIKRSACVMTLACHGGEMNGALPQWRTHLLE

RTYKMNVVGGPASSDALITG

MQSDPILLVCATRGHEWAILFGHNLWACVNMLETAIKLEGVF

GSVLRAEKLNKAAPETYIN..

Database

MDERHILNMKLQWVCSDLPT

YWASDLENQIKRSACVMTLACHGGEMNGALPQWRTHLLE

RTYKMNVVGGPASSDALITG

MQSDPILLVCATRGHEWAILFGHNLWACVNMLETAIKLEGVF

GSVLRAEKLNKAAPETYIN..

Page 121: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filtration in MS/MS Sequencing

• Filtration in MS/MS is more difficult than in BLAST.

• Early approaches using Peptide Sequence Tags were not able to substitute the complete database search.

• Can we design a filtration based search that can replacethe database search, and is orders of magnitude faster?

Page 122: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Asking the Old Question Again: Why

Not Sequence De Novo?• De novo sequencing is still not very accurate!

Algorithm Amino Acid

Accuracy

Whole Peptide

Accuracy

Lutefisk (Taylor and Johnson, 1997). 0.566 0.189

SHERENGA (Dancik et. al., 1999). 0.690 0.289

Peaks (Ma et al., 2003). 0.673 0.246

PepNovo (Frank and Pevzner, 2005). 0.727 0.296

Page 123: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

So What Can be Done with De Novo?

• Given an MS/MS spectrum:• Can de novo predict the entire peptide sequence?

• Can de novo predict partial sequences?

• Can de novo predict a set of partial sequences, that with

high probability, contains at least one correct tag?

A Covering Set of Tags

- No!(accuracy is less than 30%).

- No!(accuracy is 50% for GutenTag and 80% for PepNovo )

- Yes!

Page 124: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Sequence Tags

• A Peptide Sequence Tag is short substring of

a peptide.

Example: G V D L K

G V D

V D L

D L K

Tags:

Page 125: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filtration with Peptide Sequence Tags• Peptide sequence tags can be used as filters in

database searches.

• The Filtration: Consider only database peptides that contain the tag (in its correct relative mass location).

• First suggested by Mann and Wilm (1994).

• Similar concepts also used by:• GutenTag - Tabb et. al. 2003.

• MultiTag - Sunayev et. al. 2003.

• OpenSea - Searle et. al. 2004.

Page 126: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Why Filter Database Candidates?

• Filtration makes genomic database searches practical (BLAST).

• Effective filtration can greatly speed-up the process, enabling expensive searches involving post-translational modifications.

• Goal: generate a small set of covering tags and use them to filter the database peptides.

Page 127: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tag Generation - Global Tags

• Parse tags from de novo reconstruction.

• Only a small number of tags can be generated.

• If the de novo sequence is completely incorrect, none of the tags will be correct.

W

R

A

C

VG

E

K

DW

LP

T

LT

AVGELTK

TAG Prefix Mass

AVG 0.0

VGE 71.0

GEL 170.1

ELT 227.1

LTK 356.2

Page 128: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tag Generation - Local Tags

• Extract the highest scoring subspaths from the spectrum graph.

• Sometimes gets misled by locally

promising-looking “garden paths”.

WR

A

C

V

GE

K

DW

LP

T

L T

TAG Prefix Mass

AVG 0.0

WTD 120.2

PET 211.4

Page 129: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ranking Tags

• Each additional tag used to filter increases the number of database hits and slows down the database search.

• Tags can be ranked according to their scores, however this ranking is not very accurate.

• It is better to determine for each tag the “probability” that it is correct, and choose most probable tags.

Page 130: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reliability of Amino Acids in Tags

• For each amino acid in a tag we want to assign a probability that it is correct.

• Each amino acid, which corresponds to an edge in the spectrum graph, is mapped to a feature space that consists of the features that correlate with reliability of amino acid

prediction, e.g. score reduction due to edge removal

Page 131: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Score Reduction Due to Edge Removal

• The removal of an edge corresponding to a genuine amino acid usually leads to a reduction in the score of the de novo path.

• However, the removal of an edge that does notcorrespond to a genuine amino acid tends to leave the score unchanged.

WR

A

C

VG

K

DWL

P

T

L T

WR

A

C

VG

K

DWL

P

T

L T

WR

A

C

VG

K

DWL

P

T

L TE

W

WR

A

C

VG

K

DL

P

T

L T

Page 132: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Probabilities of Tags

• How do we determine the probability of a

predicted tag ?

• We use the predicted probabilities of its amino

acids and follow the concept:

a chain is only as strong as its weakest link

Page 133: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Experimental ResultsLength 3 Length 4 Length 5

Algorithm \ #tags 1 10 1 10 1 10

GlobalTag 0.80 0.94 0.73 0.87 0.66 0.80

LocalTag+ 0.75 0.96 0.70 0.90 0.57 0.80

GutenTag 0.49 0.89 0.41 0.78 0.31 0.64

Page 134: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tag-based Database Search

Tag filter SignificanceScoreTag

extension

De novo

Db55M peptides

Candidate Peptides (700)

Page 135: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Matching Multiple Tags

• Matching of a sequence tag against a database is fast

• Even matching many tags against a database is fast

• k tags can be matched against a database in time proportional to database size, but independent of the number of tags.

• keyword trees (Aho-Corasick algorithm)• Scan time can be amortized by combining scans for

many spectra all at once.

• build one keyword tree from multiple spectra

Page 136: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Keyword Trees

Y A K

SN

N

F

F

AT

YFAK

YFNS

FNTA

…..Y F R A Y F N T A…..

Page 137: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tag Extension

Filter SignificanceScoreExtension

De novo

Db55M peptides

CandidatePeptides(700)

Page 138: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Given: • tag with prefix and suffix masses <mP> xyz <mS>

• match in the database

• Compute if a suffix and prefix match with allowable modifications.

• Compute a candidate peptide with most likely positions of modifications (attachment points).

Fast Extension

xyz<mP>xyz<mS>

Page 139: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Modified Peptides

Filter SignificanceScoreExtension

De novo

Db55M peptides

Page 140: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring

• Input:

• Candidate peptide with attached modifications

• Spectrum

• Output:

• Score function that normalizes for length, as

variable modifications can change peptide length.

Page 141: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Assessing Reliability of Identifications

Filter SignificanceScoreextension

De novo

Db55M peptides

Page 142: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Selecting Features for Separating

Correct and Incorrect Predictions• Features:

• Score S: as computed

• Explained Intensity I: fraction of total intensity explained by annotated peaks.

• b-y score B: fraction of b+y ions annotated

• Explained peaks P: fraction of top 25 peaks annotated.

• Each of I,S,B,P features is normalized (subtract mean and divide by s.d.)

• Problem: separate correct and incorrect identifications using I,S,B,P

Page 143: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Separating power of features

Page 144: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Separating power of features

Quality scores:

Q = wI I + wS S + wB B + wP P

The weights are chosen to minimize

the mis-classification error

Page 145: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Distribution of Quality Scores

Page 146: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Results on ISB data-set

• All ISB spectra were searched.

• The top match is valid for 2978 spectra (2765 for Sequest)

• InsPecT-Sequest: 644 spectra (I-S dataset)

• Sequest-InsPecT: 422 spectra (S-I dataset)

• Average explained intensity of I-S = 52%

• Average explained intensity of S-I = 28%

• Average explained intensity I S = 58%

• ~70 Met. Oxidations

• Run time is 0.7 secs. per spectrum (2.7 secs. for Sequest)

Page 147: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filtration Results

The search was done against SWISS-PROT (54Mb).• With 10 tags of length 3:

• The filtration is 1500 more efficient.

• Less than 4% of spectra are filtered out.

• The search time per spectrum is reduced by two orders of magnitude as compared to SEQUEST.

PTMs Tag

Length

# Tags Filtration InsPecT

Runtime

SEQUEST

Runtime

None 3 1 3.4 10-7 0.17 sec > 1 minute

3 10 1.6 10-6 0.27 sec

Phosphory

lation

3 1 5.8 10-7 0.21 sec > 2 minutes

3 10 2.7 10-6 0.38 sec

Page 148: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPIDER: Yet Another Application of de novoSequencing

• Suppose you have a good MS/MS spectrum of an elephant peptide

• Suppose you even have a good de novoreconstruction of this spectra

• However, until elephant genome is sequenced, it is hard to verify this de novo reconstruction

• Can you search de novo reconstruction of a peptide from elephant against human protein database?

• SPIDER algorithm addresses this comparative proteomics problem

Slides from Bin Ma, University of Western Ontario

Page 149: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Common de novo sequencing errors

GG

N and GG

have the

same mass

Page 150: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

From de novo Reconstruction to Database

Candidate through Real Sequence

• Given a sequence with errors, search for

the similar sequences in a DB.

(Seq) X: LSCFAV

(Real) Y: SLCFAV

(Match) Z: SLCF-Vsequencing error

(Seq) X: LSCF-AV

(Real) Y: EACF-AV

(Match) Z: DACFKAV mass(LS)=mass(EA)

Homology mutations

Page 151: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment between de novo Candidate and

Database Candidate

• If real sequence Y is known then:

d(X,Z) = seqError(X,Y) + editDist(Y,Z)

(Seq) X: LSCF-AV

(Real) Y: EACF-AV

(Match) Z: DACFKAV

Page 152: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment between de novo Candidate and

Database Candidate

• If real sequence Y is known then:

d(X,Z) = seqError(X,Y) + editDist(Y,Z)

• If real sequence Y is unknown then the distance between de

novo candidate X and database candidate Z:

• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )

(Seq) X: LSCF-AV

(Real) Y: EACF-AV

(Match) Z: DACFKAV

Page 153: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment between de novo Candidate and

Database Candidate

• If real sequence Y is known then:

d(X,Z) = seqError(X,Y) + editDist(Y,Z)

• If real sequence Y is unknown then the distance between de novo candidate X and database candidate Z:

• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )

• Problem: search a database for Z that minimizes d(X,Z)

• The core problem is to compute d(X,Z) for given X and Z.

(Seq) X: LSCF-AV

(Real) Y: EACF-AV

(Match) Z: DACFKAV

Page 154: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Computing seqError(X,Y)• Align X and Y (according to mass).

• A segment of X can be aligned to a segment of

Y only if their mass is the same!

• For each erroneous mass block (Xi,Yi), the cost is

f(Xi,Yi)=f(mass(Xi)).

• f(m) depends on how often de novo sequencing makes errors on a segment with mass m.

• seqError(X,Y) is the sum of all f(mass(Xi)).X

Y

Z

seqError

editDist

(Seq) X: LSCFAV

(Real) Y: EACFAV

Page 155: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Computing d(X,Z)

• Dynamic Programming:

• Let D[i,j]=d(X[1..i], Z[1..j])

• We examine the last block of the alignment of

X[1..i] and Z[1..j].

(Seq) X: LSCF-AV

(Real) Y: EACF-AV

(Match) Z: DACFKAV

Page 156: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Dynamic Programming: Four Cases

• Cases A, B, C - no

de novo sequencing

errors

• Case D: de novo

sequencing error

D[i,j]=D[i,j-1]+indel D[i,j]=D[i-1,j]+indel

D[i,j]=D[i-1,j-1]+dist(X[i],Z[j]) D[i,j]=D[i’-1,j’-1]+alpha(X[i’..i],Z[j’..j])

• D[i,j] is the

minimum of the

four cases.

Page 157: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Computing alpha(.,.)

• alpha(X[i’..i],Z[j’..j])

= min m(y)=m(X[i’..i]) [seqError (X[i’..i],y)+editDist(y,Z[j’..j])]

= min m(y)=m[i’..i] [f(m[i’..i])+editDist(y,Z[j’..j])].

= f(m[i’..i]) + min m(y)=m[i’..i] editDist(y,Z[j’..j]).

• This is like to align a mass with a string.

• Mass-alignment Problem: Given a mass m and a

peptide P, find a peptide of mass m that is most similar to P (among all possible peptides)

Page 158: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Solving Mass-Alignment Problem

])[,()])1..([),((min

)])1..([,(

])..[),((min

min])..[,(

jZydistjiZymm

indeljiZm

indeljiZymm

jiZm

y

y

Page 159: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Improving the Efficiency

• Homology Match mode:• Assumes tagging (only peptides that share a tag of

length 3 with de novo reconstruction are considered) and extension of found hits by dynamic programming around the hits.

• Non-gapped homology match mode:• Sequencing error and homology mutations do not

overlap.

• Segment Match mode:• No homology mutations.

• Exact Match mode:• No sequencing errors and homology mutations.

Page 160: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Experiment Result

• The correct peptide sequence for each spectrum is known.

• The proteins are all in Swissprot but not in Human

database.

• SPIDER searches 144 spectra against both Swissprot and

human databases

Page 161: Ch08 massspec

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Example

• Using de novo reconstruction X=CCQWDAEACAFNNPGK,

the homolog Z was found in human database. At the same

time, the correct sequence Y, was found in SwissProt

database.

Seq(X): CCQ[W ]DAEAC[AF]<NN><PG>K

Real(Y): CCK AD DAEAC FA VE GP K

Database(Z): CCK[AD]DKETC[FA]<EE><GK>K

sequencing errors

homology mutations