gemoda

Download Gemoda

If you can't read please download the document

Upload: kyle-jensen

Post on 19-Jun-2015

837 views

Category:

Documents


1 download

TRANSCRIPT

  • 1. A generic motif discovery algorithm for diverse biomolecular data Kyle Jensen Gregory Stephanopoulos Department of Chemical Engineering Massachusetts Institute of Technology

2. Motif discovery is the automated search for similar regions in streams of data

  • Un-sequential data
  • No ordering

Sequential data

  • A natural ordering of the data
  • Nucleotide and amino acid sequences

3. Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream 4. There are two classes of motif discovery tools commonly used for sequence analysis

  • Exhaustive regular-expression based tools
  • Teiresias

5. Pratt Descriptive position weight matrix-based tools

  • Gibbs sampler

6. MEME 7. Consensus TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG 8. Gemoda was designed to be exhaustive and have descriptive power

  • Gemoda exhaustively returns maximal motifs
  • Uses convolution of Teiresias
  • Way of stiching together smaller patterns combinatorially

Gets descriptiveness from similarity metric

  • Generic, context dependent definition of similarity

MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix 9. Gemoda proceeds in three steps: comparison, clustering, and convolution Jensen, K., Styczynski,M., Rigoutsos,I. and Stephanopoulos,G. (2005) A generic motif discovery algorithm for sequential data.Bioinformatics, in press 10. The comparison stage is used to map the pairwise similarities between all windows in the data streams

  • Creates an distance matrix
  • Does an all-by-all comparison of windows in the data

11. Comparison function is context-specific F(w 1 , w 2 ) 12. The clustering phase is used to find groups of mutually similar windows

  • Different clustering functions have different uses
  • Clique-finding is provably exhaustive

13. K-means and other methods are faster Output clusters become elementary motifs which are convolved to make longer, maximal motifs 14. The convolution phase is used to stitch together the clusters into maximal motifs

  • The motifs should be as long as possible, without decreasing the support

elementary motifs (clusters) window ordering 15. Here we show a few representative ways in which Gemoda can be used Motif discovery in...

  • Protein sequences
  • (ppGpp)ase enzymes & finding known domains

DNA sequences

  • The LD-motif challenge problem

Protein structures

  • Conserved structures without conserved sequences

16. Gemoda can be applied to amino acid sequences as well

  • Example: (ppGpp)ase family from ENZYME database
  • Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymes
  • EC 3.1.7.2

17. Ave. length ~700 amino acids 18. 8 sequences from 8 species Searched using Gemoda

  • Minimum length = 50 amino acids

19. Minimum Blosum62 bit score = 50 bits 20. Minimum support = 100% (8/8 sequences) 21. Clustering method = clique finding Can Gemoda find this known motif? How sensitive is Gemoda to noise? 22. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif 23. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences 24. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to noise 25. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Pevzner & Sze, Proc. ISMB, 2000 26. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length ? Styczynski,M., Jensen,K., Rigoutsos,I. and Stephanopoulos,G. (2004) An extension and novel solution to the Motif Challenge Problem. Genome Informatics, 15 (2). 27. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... 28. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... 29. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ...... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs ? 30. Gemoda can also be applied to protein structures

  • Treat protein structure as alpha-carbon trace
  • Series of x,y,z coordinates
  • Use a clustering function that compares x,y,z windows
  • Root mean square deviation (RMSD)

31. unit-RMSD x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M 32. Protein structure example: human FIT vs. uridylyltransferase 33. Questions? 34. The Gemoda algorithm has guarantees of maximality and exhaustiveness

  • Maximality
  • Motifs are as long as possible

35. Motifs are as specific as possible 36. Motifs are not missing an occurrences

  • Exhaustiveness
  • All maximal motifs are found

37. No non-maximal motifs are found = motif1 = motif2 38. Gemoda may be used with nucleotide sequences to find regulatory motifs

  • The LD-motif problem: an example for the board

>AACTG >AATTA >AATTG Look for motifs of at least 3 nucletides with a Hamming distance between any window of 3 of 1 or less given: 1 = AAC 2 = ACT 3 = CTG 4 = AAT 5 = ATT 6 = TTA 7 = AAT 8 = ATT 9 = TTG We get the following windows: 39. A simple natural language example

  • Choosing a window length of L=4 gives 7 unique windows in the three sequences

Seq 1: motif Seq 2: motor Seq 3: potion 40. Here we show the comparison phase using two different similarity metrics

  • X's and dotted lines
  • Identify matrix:

O's and solid lines

  • Consonant/vowel matrix:

Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Input sequences Seq 1: motif Seq 2: motor Seq 3: potion Similarity graph 41. The clustered windows (elementary motifs) are different depending on the similarity function

  • Clustering phase
  • Clique-finding

42. Support >= 2 Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Solid lines (vowel/cons): Dotted lines (identity): 43. Likewise, the final, convolved motifs depend on the similarity function choice Motif 1 motif motor potio Seq 1: motif Seq 2: motor Seq 3: potion Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Vowel/cons: Motif 1 motif potio Motif 2 moti moto Identity: Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Vowel/cons: Identity: