a spectral library searching tool for...

35
SpectraST: A Spectral Library Searching Tool for Proteomics Henry Lam Day 3 October 18, 2006

Upload: others

Post on 16-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

1

SpectraST: A Spectral Library Searching Tool

for Proteomics

Henry LamDay 3

October 18, 2006

Page 2: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

2

Why Spectral Searching?

• Traditional sequence (database) searching is• Very computationally intensive and costly• Error prone• Unable to capitalize on past data• Good for purely discovery-oriented experiments

• Newer approaches to proteomics are often more targeted• Know what you are looking for• More interest in probing/quantifying/understanding proteome

segments that have already been mapped out• Repeated sampling of same proteome segments

Page 3: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

3

Spectral Library Searching

• Identifying an unknown peptide MS/MS (CID) spectrum by matching it against a library of known peptide MS/MS spectra

• Premise: One-to-one correspondence between peptide ion (sequence + charge + modifications) and its characteristic MS/MS “fingerprint”

• General and widely practiced method for small molecules

• Problem with proteomics until recently: lack of good spectral libraries*

* Yates, J. R., et al., Anal. Chem. 1998, 70, 3557-3565.

Page 4: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

4

Spectral Libraries• Impractical to construct spectral libraries using purified

peptides; instead, peptide-spectrum correspondence established by sequence searching on complex samples

• Recent developments enabling the construction of peptide spectral libraries

• Explosion of shotgun proteomics data• Emergence of public data repositories• Standardization of data formats

• Consensus spectral libraries: Multiple observations of the same peptide ion aggregated to form “consensus spectra”

• Decrease library size• Give higher confidence in peptide-spectrum correspondence• Average out experiment-to-experiment variations• Reduce noise and other spurious peaks (e.g. from impurities)

Page 5: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

5

Spectral Library Creation

791792

842

907

……

……

DGGGENSR

QPWHIVK

CVDAGQAK

CVDAGQAK

CVDAGQAK

TTSGGANK

IPGSGQGAR

TTSGGANK

QPWHIVK

……

PeptideAtlas(106 spectra for yeast)

CVDAGQAK

DGGGENSR

QPWHIVK

TTSGLADK

IPGSGQGAR…

Library of consensus MS/MS spectra (104 spectra)

Precursor m/z index for fast retrieval

Dataset 1

Dataset 2

Dataset 3

1. Search datasets by 4 different sequence search engines (SEQUEST, Mascot, X!Tandem, OMSSA)

2. Group replicate spectra identified to same peptide ion with high confidence

3. Combine replicates to create consensus spectra4. Apply quality filters to consensus spectra5. Build searchable libraries with indexes

* Details described in NIST library documentation

Page 6: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

6

NIST Libraries

• NIST consensus spectral libraries (as of Sept 2006)• Available on PeptideAtlas and ProteomeCommons

8,2731D. radiodurans

3,9382819 Standard proteins

3,5691M. smegmatis

45,37777Human

35,13543Yeast

# consensus spectra# datasets usedOrganism

Page 7: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

7

NIST Libraries

• In .msp formatName: AVYHVALR/2

MW: 929.550

Comment: Spec=Consensus Pep=Tryptic Fullname=R.AVYHVALR.N/2 Mods=0 Parent=464.775… Protein="gi|6319673|ref|NP_009755.1|"…Se=4^M18:sc=41.53/3.009,td=25.345/4.607…Sample=13/yeast_comp12vs12sizefrac_cam,2,2/yeast_gygi_cam,1,1/…Nreps=18/29… Probcorr=1

Num peaks: 99

136.1 354 "? 18/10 1.0"

143.1 161 "a2/-0.01 16/10 0.8"

163.9 115 "? 10/10 0.6"

170.9 483 "b2/-0.21,y3-18^2/-0.22 18/10 1.9"

Name: AVYLETIGNPK/1

Name (Peptide ID) Sequence Search Info

Source of spectraNumber of replicates

Peak list(m/z, intensity, annotation)

Page 8: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

8

Spectral Searching

791792

842

907

……

DGGGENSR

QPWHIVK

TTSGLADK

IPGSGQGAR…

Library of consensus MS/MS spectra (104 spectra)

SpectralMatching

CVDAGQAK

Precursor m/z= 790.9

CVDAGQAK

DGGGENSR

TTSGLADK

Candidate spectra with similar precursor m/z (102 spectra)

Precursor m/z index

Query spectrum with unknown ID

?

Retrieve candidates by precursor m/z

Page 9: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

9

SpectraST

• Open Source (http://sourceforge.net/projects/sashimi under trans_proteomic_pipeline)

• LINUX / Windows versions (http://tools.proteomecenter.org/TPP.php)

• Extensible, modular design

• Fully integrated with Trans-Proteomic Pipeline

• Modest processor and memory requirements

Page 10: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

10

SpectraST Create Mode

SpectraST(Create mode)

Library(.splib)

m/z index(.spidx)

RawLibrary(.msp)

NIST

Peptideindex

(.pepidx)Coming Attraction: Create your own!

Page 11: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

11

SpectraST Search Mode

SpectraST(Search mode)

Library(.splib)

Results(.xls or .pepXML)

QuerySpectra(.mzXML)

m/z index(.spidx)

Trans-Proteomic Pipeline

Page 12: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

12

SpectraST: Under the Hood

• Most behavior customizable

• Query spectra filtering and processing• Ignore spectra with too few peaks • Remove tiny (noise) peaks• Remove the parent peak and its neutral losses• Scale intensities (to deemphasize dominant peaks)

• Scaled intensity = (intensity)0.5

• Assign peaks into unit-m/z bins

Page 13: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

13

SpectraST: Under the Hood

• Similarity scoring• Dot product

(j = bin number, I = normalized scaled intensity)

• Delta Dot • Dot Bias

1.0 = one bin accounts for entire dot product~0.0 = all bins contribute equally

• Discriminant F function• Combination of the scores

∑=

=n

jlibraryquery jIjIDot

1)()(

)1()2()1(

DotDotDotDotDelta −

=

∑=

=n

jlibraryquery jIjI

DotBiasDot

1

22 )()(1

Page 14: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

14

SpectraST: Test Drive

• Test datasets• 2 yeast datasets

• LCQ - ICAT (23,000 spectra; 1,500 p>0.99 IDs)• LCQ (2,700 spectra; 990 p>0.99 IDs) <- Tutorial

• 1 human plasma dataset• Bruker Esquire (2.4 million spectra; 430,000 p>0.99 IDs)

• Speed• ~0.01s per query spectrum on P4 3.4GHz, 2GB RAM

machine (compared to ~5s for SEQUEST)• ~500x improvement in speed!

Page 15: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

15

PeptideProphet Analysis•Yeast dataset 1: LCQ – ICAT (23,000 spectra)

SpectraST

SEQUEST

Page 16: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

16

PeptideProphet Analysis

SpectraST

SEQUEST

• Yeast dataset 1: LCQ – ICAT (23,000 spectra)

Page 17: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

17

Comparison of IDs

Extra hits (not in intersection):

Missed by SEQUEST

78 (6%)376 (24%)Matched lower-confidence hits

1,2301,551 (26% more)Positive hits (P > 0.99 IDs)

7 (0.5%)Manually determined to be incorrect

252 (16%)Manually determined to be correct

3 (0.2%)In spectral library

236 (19%)Not In spectral libraryMissed by SpectraST

913 (74%)913 (59%)Intersection (matched positive hits)

SEQUESTSpectraST

Page 18: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

18

Comparison of IDs

• Lessons• SpectraST confidently identifies significantly more

spectra than SEQUEST.• For IDs found by both engines, SpectraST is more

confident about them. • When SpectraST misses a good ID, it is almost

always because the peptide is not in the library.• SpectraST is rarely wrong with its confident IDs.

Page 19: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

19

PeptideProphet Analysis•Human plasma dataset: Bruker Esquire (2.4 million spectra)

SpectraST

SEQUEST

Page 20: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

20

PeptideProphet Analysis

SpectraST

SEQUEST

• Human plasma dataset: Bruker Esquire (2.4 million spectra)

Page 21: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

21

Comparison of IDs

15,733 (4%)82,437 (19%)Matched lower-confidence hits

Extra hits (not in intersection):

Missed by SEQUEST

97,918 (28%)167,557 (39%)Not matched, but presence of peptide confirmed

349,795427,056 (22% more)Positive hits (P > 0.99 IDs)

39 (0.01%)Manually determined to be incorrect

3,836 (1%)Manually determined to be correct

277 (0.08%)In spectral library

58,316 (17%)Not In spectral libraryMissed by SpectraST

177,551 (51%)177,551 (41%)Intersection (matched positive hits)

SEQUESTSpectraST

Page 22: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

22

Advantages of Spectral Searching (1)

• Smaller search space• Only searching against peptides that are known to

occur in shotgun proteomics – not putative proteolytic peptides of the entire sequence, most of which will never be observed experimentally

• No need to search multiple charge stateswhich leads to…• Huge savings in time• More confidence in peptide ID if made multiple times

in the past

Page 23: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

23

Search Space

773.6 x 104800 – 803

1.7 x 1041.5 x 107300 – 2000At least semi-tryptic,with at most 1 missed

internal cleavage,+2/+3

527.2 x 102800 – 803

1.1 x 1043.5 x 105300 – 2000Tryptic, with no missed internal cleavage,

+2/+3

Spectral Search

Sequence Search

Precursor m/z rangeSequence characteristics

Search spaceCriterion

Page 24: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

24

Advantages of Spectral Searching (2)

• More precise similarity scoring• Made use of global similarity of spectra• Peak intensities are accounted for naturally• All consistently observed peaks – attributable to

common ions or not – are usedwhich leads to…• Simpler similarity scoring function, greater speed• Greater separation between good and bad matches• Superior sensitivity and error rates• Increased ability to pick up more obscure matches

Page 25: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

25

Spectra Viewer

Click to open viewer

Page 26: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

26

Spectral Match

Library

Query

Page 27: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

27

Spectral Match

Page 28: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

28

Theoretical Spectrum

Theoretical (SEQUEST)

Query

Page 29: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

29

Noisy Query Spectrum

Library

Query

Page 30: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

30

Noisy Query Spectrum

Page 31: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

31

Advantages of Spectral Searching (3)

• Implicitly searching multiple sequence search engines• Sequence search engines have their own idiosyncrasies,

and will yield largely overlapping (~70% typical), but not identical set of peptide identifications

• NIST spectral libraries are compiled by combining the confident hits of 4 different search engines

• Spectral searching take advantage of the strength of each of the engines without the additional time and effort

which leads to…• More identifications• Additional confidence for identifications found by multiple

search engines

Page 32: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

32

Potential Pitfalls

• Library coverage• Will only find peptides that are previously observed and

represented in the library• OK for targeted proteomics: you know what you are looking for

• Should improve over time with more data

• Library quality• A poorly constructed library will lead to false positives and

negatives• Mis-identified library spectra• Noisy or impure library spectra• Similar spectra mapping to distinct peptides

• Need stringent confidence thresholds and quality filters

Page 33: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

33

Future Outlook

• Libraries• More organisms (Drosophila, mouse)• Coverage will improve with more data (Please contribute!)

• More sampling conditions• More advanced instruments• More kinds of modifications

• Custom library building functionalities in SpectraST future release – build your own libraries for your favorite organism/tissue/proteins

• Searching • Web interface on PeptideAtlas• Improvement in SpectraST workflow and scoring

Page 34: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

34

Conclusions

• SpectraST is an open-source tool for spectral library searching of peptide MS/MS spectra

• Fast• Improved sensitivity and error rate• More identifications• Integrated into TPP

• Spectral library searching is here and will get better• Applicable for non-discovery type experiments• Will eventually replace sequence searching in most

typical shotgun proteomics workflows

Page 35: A Spectral Library Searching Tool for Proteomicstools.proteomecenter.org/course/lectures/0610-Day3.Lam.pdf · Spectral Library Searching • Identifying an unknown peptide MS/MS (CID)

35

Tutorial