direct experimental observation of functional protein isoforms by tandem mass spectrometry

Direct Experimental Observation

of Functional Protein Isoforms

by Tandem Mass Spectrometry

Direct Experimental Observation

of Functional Protein Isoforms

by Tandem Mass Spectrometry

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Synopsis

• MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.

• Key concepts:• Spectrum acquisition is unbiased• Direct observation of amino-acid sequence• Sensitive to small sequence variations

Synopsis

• MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.

• Applications:• Cancer biomarkers• Genome annotation

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

Mass Spectrometer

Ionizer

Sample

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

High Bandwidth

0250 500 750 1000

Mass is fundamental!

• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein / genome sequences• A reference for comparison

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

Single Stage MS

Tandem Mass Spectrometry(MS/MS)

Precursor selection

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from (any) sequence database• Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ...

• Automated, high-throughput peptide identification in complex mixtures

Peptide Identification

...can provide direct experimental evidence for the amino-acid sequence of functional proteins.

Evidence for:• Functional protein isoforms• Translation start and frame• Proteins with short open-reading-frames

Why is this useful for ...... genome annotation?

• Evidence for SNPs and alternative splicing stops with transcription

• No genomic or transcript evidence for translation start-site.

• Conservation doesn’t stop at coding bases!

• Statistical gene-finders struggle with micro-exons, translation start-site, and short ORFs.

Why is this useful for ...... cancer biomarkers?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins• Some splicing is believed to be silencing• Lots of splicing in cancer

• Proteins have clinical implications• Statistical biomarker discovery• Putative malfunctioning proteins

What can be observed?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Microexons ( non-cannonical splice-sites )

• Alternative translation start-sites ( codons )

• Alternative translation frames

• “Dark” open-reading-frames

Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

Splice Isoform

Novel Splice Isoform

Novel Mutation

• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female

healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)

• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy

• late-onset, dominant inheritance

Novel Mutation

Ala2→Pro associated with familial amyloid polyneuropathy

Novel Mutation

Translation Start-Site

• Human erythroleukemia K562 cell-line• Depth of coverage study• Resing et al. Anal. Chem. 2004.

• THOC2 gene:• Part of the heteromultimeric THO/TREX complex.

• Initially believed to be a “novel” ORF• RefSeq mRNA in Jun 2007, no RefSeq protein• TrEMBL entry Feb 2005, no SwissProt entry• Genbank mRNA in May 2002 (complete CDS)• Plenty of EST support• ~ 100,000 bases upstream of other isoforms

Translation Start-Site

Easily distinguish minor sequence variations

Two B. anthracis Sterne α/β SASP annotations

• RefSeq/Gb: MVMARN... (7441 Da)• CMR: MARN... (7211 Da)

• Intact proteins differ by 230 Da• 7441 Da vs 7211 Da

• N-terminal tryptic peptides:• MVMAR (606.3 Da), MVMARNR (876.4 Da), vs• MARNR (646.3 Da)• Very different MS/MS spectra

Bacterial Gene-Finding

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stopcodon

• Find all the open-reading-frames...

...courtesy of Art Delcher

Bacterial Gene-Finding

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stopcodon

…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…

ShiftedStop

Stopcodon

Reversestrand

• Find all the open-reading-frames...

...but they overlap – which ones are correct?

Coding-Sequence “Score”

Glimmer3 Performance

Organism Length GC% # Genes ExtraArchaeoglobus fulgidus 2.18Mb 48.6 1165 1162 99.70% 875 75.10% 1305Bacillus anthracis 5.23Mb 35.4 3132 3129 99.9% 2768 88.4% 2340Bacillus subtilis 4.21Mb 43.5 1576 1567 99.4% 1429 90.7% 2879Campylobacter jejuni 1.78Mb 30.3 1233 1233 100.0% 1149 93.2% 668Carboxydothermus hydrogenoformans 2.40Mb 42.0 1753 1752 99.9% 1590 90.7% 865Caulobacter crescentus 4.02Mb 67.2 2192 2187 99.8% 1552 70.8% 1559Chlorobium tepidum 2.15Mb 56.5 1292 1289 99.8% 949 73.5% 765Clostridium perfringens 3.03Mb 28.6 1504 1503 99.9% 1385 92.1% 1178Colwellia psychrerythraea 5.37Mb 38.0 3063 3060 99.9% 2663 86.9% 1714Dehalococcoides ethenogenes 1.47Mb 48.9 1069 1059 99.1% 929 86.9% 483Escherichia coli 4.64Mb 50.8 3603 3553 98.6% 3150 87.4% 913Geobacter sulfurreducens 3.81Mb 60.9 2351 2340 99.5% 1974 84.0% 1091Haemophilus influenzae 1.83Mb 38.1 1170 1170 100.0% 1054 90.1% 639Helicobacter pylori 1.67Mb 38.9 915 914 99.9% 805 88.0% 765Listeria monocytogenes 2.91Mb 38.0 1966 1965 99.9% 1797 91.4% 845Methylococcus capsulatus 3.30Mb 63.6 2015 2005 99.5% 1542 76.5% 1231Mycobacterium tuberculosis 4.40Mb 65.6 2217 2205 99.5% 1493 67.3% 2104Neisseria meningitidis 2.27Mb 51.5 1232 1217 98.8% 1042 84.6% 1329Porphyromonas gingivalis 2.34Mb 48.3 1200 1198 99.8% 933 77.8% 887Pseudomonas fluorescens 7.07Mb 63.3 4535 4503 99.3% 3577 78.9% 1871Pseudomonas putida 6.18Mb 61.5 3633 3596 99.0% 2825 77.8% 1916Ralstonia solanacearum 3.72Mb 67.0 2512 2487 99.0% 2061 82.0% 1077Staphylococcus epidermidis 2.62Mb 32.1 1650 1649 99.9% 1511 91.6% 771Streptococcus agalactiae 2.16Mb 35.6 1441 1438 99.8% 1336 92.7% 683Streptococcus pneumoniae 2.16Mb 39.7 1359 1355 99.7% 1214 89.3% 780Thermotoga maritima 1.86Mb 46.2 1092 1090 99.8% 892 81.7% 804Treponema denticola 2.84Mb 37.9 1463 1463 100.0% 1332 91.0% 1210Treponema pallidum 1.14Mb 52.8 575 572 99.5% 425 73.9% 557Ureaplasma parvum 0.75Mb 25.5 327 327 100.0% 300 91.7% 293Wolbachia endosymbiont 1.08Mb 34.2 628 627 99.8% 528 84.1% 537

99.6% 84.3%Averages:

Genome Glimmer3 PredictionsMatches Correct Starts

• Glimmer3 trained & compared to RefSeq genes with annotated function

• Correct STOP:• 99.6%

• Correct START:• 84.3%

• “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.”

N-terminal peptides

• (Protein) N-terminal peptides establish• start-site of known & unexpected ORFs

Use:• Directly to annotate genomes• Evaluate and improve algorithms• Map cross-species

N-terminal peptide workflows

• Typical proteomics workflows sample peptides from the proteome “randomly”

• Caulobacter crescentus (70%)• 3733 Proteins (RefSeq Genome annot.)• 66K tryptic peptides (600 Da to 3000 Da)• 2085 N-terminal tryptic peptides (3%)

N-terminal peptide workflow

• Protect protein N-terminus

• Digest to peptides• Chemically modify

free peptide N-term• Use chem. mod. to

capture unwanted peptides

Nat Biotech, Vol. 21, pp. 566-569, 2003.

Increasing N-terminal peptide coverage

• Multiple (digest) enzymes:• trypsin-R:

60% (80%)• acid + lys-C + trypsin:

85% (94%)• Repeated LC-MS/MS• Precursor Exclusion /

Inclusion lists• MALDI / ESI• Protein separation

and/or orthogonal fractionation Anal Chem, Vol. 76, pp. 4193-4201, 2004.

Proteomics Informatics

• Search spectra against:• Entire bacterial genome;• All Met initiated peptides; or • Statistically likely Met initiated peptides.

• Easily consider initial Met loss PTM, too

• Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)

Other Practical Issues

• Suitable for commonly available instrumentation• Only the sample prep. is (somewhat) novel.

• Need living organism• Stage of life-cycle?

• Bang for buck?• N-terminal peptides / $$$$

• In discussions with JCVI (ex TIGR)• Possible pilot project?

Other Research Projects

• Improving peptide identification by MS/MS• Spectral matching using HMMs• Combining search engine results • Spectral matching for detection and quantitation

• Microorganism identification using MS• Live public web-site and database

• (Inexact) uniqueness guarantees• Primer/Probe oligo design• Pathogen detection (DNA & Peptide)• Significant false-positive peptide identifications

Spectral Matching

• Detection vs. identification• Increased sensitivity• No novel peptides

• NIST GC/MS Spectral Library• Identifies small molecules, • 100,000’s of (consensus) spectra• Bundled/Sold with many instruments• “Dot-product” spectral comparison• Current project: Peptide MS/MS

Peptide DLATVYVDVLK

Hidden Markov Models for Spectral Matching

• Capture statistical variation and consensus in peak intensity

• Capture semantics of peaks• Extrapolate model to other peptides

• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (w/ p-value < 10-5)

www.RMIDb.org

Statistics:• 16.7 x 106 (6.4 x 106) protein sequences• ~ 40,000 organisms, ~ 19,700 species• 557 (415) complete genomes

Sources:• TIGR’s CMR, SwissProt, TrEMBL, Genbank

Proteins, RefSeq Proteins & Genomes• Inclusive Glimmer3 predictions on Genomes• Pfam and GO assignments using BOINC grid

www.RMIDb.org

Accessed from all over the world...

Uniqueness guarantees

• 20-mer oligo signatures for B. anthracis• In all available strains as exact match• No (inexact) match to other Bacillus species

Specificity # Signatures % of genome

Exact 2035086 39.4%

k = 1 866787 16.8%

k = 2 75795 1.5%

k = 3 174 0.003%

Uniqueness guarantees

• Human genome primer design problem

• “4-unique” DNA 20-mers:• Edit-distance ≥ 5 to any non-specific

hybridization site• No such valid loci on Chr. 22!• Currently analyzing entire genome

• “3-unique” DNA 20-mers:• Initial experiments suggest ~ 0.01% valid• Approx. 1 valid oligo every 10,000 bases

Future Research Plans

• Cancer biomarkers:

• Optimize proteomics workflow for protein sequence coverage

• Improve informatics infrastructure to make interpretation easier

• Identify splice variants in cancer cell-lines (MCF-7) and clinical brain tumor samples

• Genome Annotation

• Collect evidence for functional alternative splicing in public datasets into dbPEP.

• Conduct pilot project for bacterial genome annotation with JCVI.

• Improve informatics infrastructure to make interpretation easier.

• Peptide Identification

• Expand library of HMM models for high-confidence spectral matching

• Spectral matching for biomarkers and quantitation (with Calibrant).

• Specificity metric for peptides identified using MS/MS

• Microorganism identification by mass spectrometry

• Specificity of tandem mass spectra

• Revamp RMIDb prototype

• Incorporate spectral matching, top-down.

• Oligonucleotide Design

• Uniqueness oracle for inexact match in human

• Integration with Primer3

• Tiling, multiplexing, pooling, & tag arrays

Acknowledgements

• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Cheng Lee, Brian Balgley• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: NIH/NCI, USDA/ARS

direct experimental observation of functional protein isoforms by tandem mass spectrometry

proteomicsmeasure mass

transcript evidence

lck gene

direct experimental

alternative splicing

proteinssome splicing

silencinglots of splicing

acid sequencesensitive

Documents

liquid chromatography – tandem mass spectrometry...liquid...

introduction to tandem mass spectrometry

tandem mass spectrometry of peptides

rapid liquid chromatography–tandem mass spectrometry-based...

ms/ms) and gas chromatography-tandem mass spectrometry …

24 a new liquid chromatography–tandem mass spectrometry...

mass spectrometry and tandem mass ... -...

peptide identification via tandem mass spectrometry sorin...

using tandem mass spectrometry to choose appropriate kinase

tandem mass spectrometry assays of palmitoyl...

tandem extraction/liquid chromatography-mass spectrometry

tandem mass spectrometry of sphingolipids: application in

chromatography tandem mass spectrometry hhs public access

multiplex tandem mass spectrometry enzymatic...

implementation of tandem mass spectrometry: quality

statistical significance for peptide identification by...

protonation sites, tandem mass spectrometry and...

tandem mass spectrometry analysis of prostaglandins and

tandem mass spectrometry of sphingolipids:...

tandem mass spectrometry of synthetic...