novel peptide identification using ests and sequence database compression

Novel Peptide Identification using

ESTs and Sequence Database

Compression

Novel Peptide Identification using

ESTs and Sequence Database

Compression

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

What is missing from protein sequence databases?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

3

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

4

Novel Splice Isoform

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

5

Novel Splice Isoform

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr20:61839670-61839828&hgsid=68112320&est=pack

6

Novel Mutation

Ala2→Pro associated with familial amyloid polyneuropathy

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300002887.18.xml&uid=202568&label=AAAKEPZA&homolog=AAAKEPZA&id=1838.1.1&proex=-1

7

Novel Mutation

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

8

Searching ESTs

• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.

• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret

• Make EST searching feasible for routine searching to discover novel peptides.

9

Searching Expressed Sequence Tags (ESTs)

Pros• No introns!• Primary splicing

evidence for annotation pipelines

• Evidence for dbSNP• Often derived from

clinical cancer samples

Cons• No frame• Large (8Gb)• “Untrusted” by

annotation pipelines• Highly redundant• Nucleotide error

rate ~ 1%

10

Other Search Strategies

• Genome Corrected ESTs• Large (2Gb)• Controls for nucleotide error rate• Polymorphism lost, potential errors introduced

• Genome Clustered ESTs• Small, Gene model• Convergence to well-understood isoforms• Controls nucleotide error rate

• Full-Length mRNAs• Incomplete gene coverage, “most” are already

in IPI

11

Other Search Strategies

• Genome• Large (6Gb), lots of non-coding DNA• Find novel ORFs, no sampling bias• Miss spliced peptide sequences.

• Genscan Exons• Small, find novel ORFs.• Miss spliced peptide sequences.

• How should we interpret peptide identifications with no mRNA evidence?

12

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

13

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

14

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

15

Compressed SBH-graph


16

Sequence Databases & CSBH-graphs

• Original sequences correspond to paths


17

Sequence Databases & CSBH-graphs

• All k-mers represented by an edge have the same count

2 2

1

2

1

18

CSBH-graphs

• Quickly determine which k-mers occur at least twice

2 2

1

2

19

de Bruijn Sequences

de Bruijn sequences represent all words of length k from some alphabet A.

A = {0,1}, k = 3: s = 0001110100

A = {0,1}, k = 4: s = 0000111101011001000

20

de Bruijn Graph: A = {0,1}, k = 4

110

011

100

001

000 010 111101

1 1

11

1

11

1

00

0

00

0

00

21

Correct, Complete, Compact (C3) Enumeration

• Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG

22

Correct, Complete (C2) Enumeration

• Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG

23

Patching the CSBH-graph

• Use artificial edges to fix unbalanced nodes

24

Patching the CSBH-graph

• Use matching-style formulations to choose artificial edges• Optimal C2/C3 enumeration in polynomial time.

• Chinese Postman Problem• Edmonds and Johnson, ’73

• l-tuple DNA sequencing• Pevzner, ’89

• Shortest (Common) Superstring• MAX-SNP-hard, 2.5 approx algorithm

25

C3 Enumeration

1

3

2

1

3

-2

-1

-4

-1

-2Cost: k

#in-#out #in-#out

26

C3 Enumeration

1

3

2

1

3

-2

-1

-4

-1

-2

#in-#out #in-#out

Cost: k

0 0

Cost: 0 Cost: 0

27

Reusing Edges

ACD HAC

EHAC

FHAC

GHAC

D

• ACDEHAC, ACDFHAC, ACDGHACD

28

• C3: ACDEHACDFHAC, ACDGHACD

Reusing Edges

ACD HAC

EHAC

FHAC

GHAC

D

$ACD

29

• C2: ACDEHACDFHACDGHAC

Reusing Edges

ACD HAC

EHAC

FHAC

GHAC

D

D

30

C2 Enumeration

1

3

2

1

3

-2

-1

-4

-1

-2

4

7

10

“Shortcut paths”

#in-#out #in-#out

31

Implementation

• CSBH-graph construction • Determine non-trivial nodes directly• Consecutive non-trivial nodes determine edges

• C3/C2 enumeration• C3: Trivial “assignment” of artificial edges• C2: Depth-first search &

Goldberg’s CS2 min cost flow code• Eulerian path algorithm

• Can be applied to entire EST database• Condor grid and PBS cluster for CSBH-graph

construction• Large memory machine for C3/C2 enumeration

32

Conclusions

• Peptides identify more than just proteins• Compressed peptide sequence databases

makes routine EST searching feasible• Currently available for download• Can include other sources of peptide sequence

at little additional cost.• CSBH-graph + edge counts +

C2/C3 enumeration algorithms• Minimal FASTA representation of k-mer sets

33

Acknowledgements

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Catherine Fenselau, Crystal Harvey• UMCP Biochemistry

• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: National Cancer Institute

novel peptide identification using ests and sequence database compression

Documents

protein sequences

novel peptide identification

novel peptides

peptide identifications

noncoding dnafind novel

unigene gene

c2 fasta databasecomplete

fasta entriesrunning