novel peptide identification using ests and sequence database compression
DESCRIPTION
Novel Peptide Identification using ESTs and Sequence Database Compression. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. What is missing from protein sequence databases?. Known coding SNPs Novel coding mutations - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/1.jpg)
Novel Peptide Identification using
ESTs and Sequence Database
Compression
Novel Peptide Identification using
ESTs and Sequence Database
Compression
Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park
![Page 2: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/2.jpg)
2
What is missing from protein sequence databases?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
![Page 3: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/3.jpg)
3
Why don’t we see more novel peptides?
• Tandem mass spectrometry doesn’t discriminate against novel peptides...
...but protein sequence databases do!
• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
![Page 4: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/4.jpg)
4
Novel Splice Isoform
![Page 5: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/5.jpg)
5
Novel Splice Isoform
![Page 6: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/6.jpg)
6
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
![Page 7: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/7.jpg)
7
Novel Mutation
![Page 8: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/8.jpg)
8
Searching ESTs
• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.
• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret
• Make EST searching feasible for routine searching to discover novel peptides.
![Page 9: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/9.jpg)
9
Searching Expressed Sequence Tags (ESTs)
Pros• No introns!• Primary splicing
evidence for annotation pipelines
• Evidence for dbSNP• Often derived from
clinical cancer samples
Cons• No frame• Large (8Gb)• “Untrusted” by
annotation pipelines• Highly redundant• Nucleotide error
rate ~ 1%
![Page 10: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/10.jpg)
10
Other Search Strategies
• Genome Corrected ESTs• Large (2Gb)• Controls for nucleotide error rate• Polymorphism lost, potential errors introduced
• Genome Clustered ESTs• Small, Gene model• Convergence to well-understood isoforms• Controls nucleotide error rate
• Full-Length mRNAs• Incomplete gene coverage, “most” are already
in IPI
![Page 11: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/11.jpg)
11
Other Search Strategies
• Genome• Large (6Gb), lots of non-coding DNA• Find novel ORFs, no sampling bias• Miss spliced peptide sequences.
• Genscan Exons• Small, find novel ORFs.• Miss spliced peptide sequences.
• How should we interpret peptide identifications with no mRNA evidence?
![Page 12: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/12.jpg)
12
Compressed EST Peptide Sequence Database
• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results
![Page 13: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/13.jpg)
13
Compressed EST Peptide Sequence Database
• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results
![Page 14: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/14.jpg)
14
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
![Page 15: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/15.jpg)
15
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
![Page 16: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/16.jpg)
16
Sequence Databases & CSBH-graphs
• Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
![Page 17: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/17.jpg)
17
Sequence Databases & CSBH-graphs
• All k-mers represented by an edge have the same count
2 2
1
2
1
![Page 18: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/18.jpg)
18
CSBH-graphs
• Quickly determine which k-mers occur at least twice
2 2
1
2
![Page 19: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/19.jpg)
19
de Bruijn Sequences
de Bruijn sequences represent all words of length k from some alphabet A.
A = {0,1}, k = 3: s = 0001110100
A = {0,1}, k = 4: s = 0000111101011001000
![Page 20: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/20.jpg)
20
de Bruijn Graph: A = {0,1}, k = 4
110
011
100
001
000 010 111101
1 1
11
1
11
1
00
0
00
0
00
![Page 21: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/21.jpg)
21
Correct, Complete, Compact (C3) Enumeration
• Set of paths that use each edge exactly once
ACDEFGEFGI, DEFACG
![Page 22: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/22.jpg)
22
Correct, Complete (C2) Enumeration
• Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
![Page 23: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/23.jpg)
23
Patching the CSBH-graph
• Use artificial edges to fix unbalanced nodes
![Page 24: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/24.jpg)
24
Patching the CSBH-graph
• Use matching-style formulations to choose artificial edges• Optimal C2/C3 enumeration in polynomial time.
• Chinese Postman Problem• Edmonds and Johnson, ’73
• l-tuple DNA sequencing• Pevzner, ’89
• Shortest (Common) Superstring• MAX-SNP-hard, 2.5 approx algorithm
![Page 25: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/25.jpg)
25
C3 Enumeration
1
3
2
1
3
-2
-1
-4
-1
-2Cost: k
#in-#out #in-#out
![Page 26: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/26.jpg)
26
C3 Enumeration
1
3
2
1
3
-2
-1
-4
-1
-2
#in-#out #in-#out
Cost: k
0 0
Cost: 0 Cost: 0
![Page 27: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/27.jpg)
27
Reusing Edges
ACD HAC
EHAC
FHAC
GHAC
D
• ACDEHAC, ACDFHAC, ACDGHACD
![Page 28: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/28.jpg)
28
• C3: ACDEHACDFHAC, ACDGHACD
Reusing Edges
ACD HAC
EHAC
FHAC
GHAC
D
$ACD
![Page 29: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/29.jpg)
29
• C2: ACDEHACDFHACDGHAC
Reusing Edges
ACD HAC
EHAC
FHAC
GHAC
D
D
![Page 30: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/30.jpg)
30
C2 Enumeration
1
3
2
1
3
-2
-1
-4
-1
-2
4
7
10
“Shortcut paths”
#in-#out #in-#out
![Page 31: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/31.jpg)
31
Implementation
• CSBH-graph construction • Determine non-trivial nodes directly• Consecutive non-trivial nodes determine edges
• C3/C2 enumeration• C3: Trivial “assignment” of artificial edges• C2: Depth-first search &
Goldberg’s CS2 min cost flow code• Eulerian path algorithm
• Can be applied to entire EST database• Condor grid and PBS cluster for CSBH-graph
construction• Large memory machine for C3/C2 enumeration
![Page 32: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/32.jpg)
32
Conclusions
• Peptides identify more than just proteins• Compressed peptide sequence databases
makes routine EST searching feasible• Currently available for download• Can include other sources of peptide sequence
at little additional cost.• CSBH-graph + edge counts +
C2/C3 enumeration algorithms• Minimal FASTA representation of k-mer sets
![Page 33: Novel Peptide Identification using ESTs and Sequence Database Compression](https://reader035.vdocuments.net/reader035/viewer/2022062423/56814cf9550346895dba0c40/html5/thumbnails/33.jpg)
33
Acknowledgements
• Chau-Wen Tseng, Xue Wu• UMCP Computer Science
• Catherine Fenselau, Crystal Harvey• UMCP Biochemistry
• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: National Cancer Institute