an implementation of the teiresias algorithm na zhao chengjun zhan
TRANSCRIPT
![Page 1: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/1.jpg)
An Implementation of The Teiresias Algorithm
Na ZhaoChengjun Zhan
![Page 2: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/2.jpg)
Outline
Introduction of Pattern Discovery Basic Definitions Teiresias Algorithm &
Implementation Scan Phase Convolution Phase
Results & Conclusion Q & A
![Page 3: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/3.jpg)
What is Pattern Discovery? Pattern:
A region or portion of a protein sequence. It may has specific structure and functionally significant.
Protein family should have similar patterns so they can be characterized.
Pattern Discovery: Detect patterns from known protein sequences. The result can be used to detect unknown
protein sequences.
![Page 4: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/4.jpg)
Why Pattern Discovery Useful?
For some proteins that have similar biological properties on structural or functional features: Group together these protein sequences Discover a set of common sub-
sequences Study and observe these sub-sequences The detected patterns may help to
classify a protein
![Page 5: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/5.jpg)
Basic Definitions ∑ (Basic alphabet set):
The amino-acid with names can be abbreviated as the listed symbols in alphabetical order
∑={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} (Character/Equivalency group
alphabets): a character corresponding to a subset of Σ. Ex: A-[CJ]-G
. (Wild-card or don’t care): a special kind of ambiguous character that
matches any character in ∑. Ex: X in protein sequences and N in nucleotide
![Page 6: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/6.jpg)
Basic Definitions (Cont.) Pattern P is a (L, W) pattern iff:
P is a string of characters (Σ and wild cards ‘.’). P starts and ends with a character from Σ.
Characters in Σ are called residues. Any sub pattern of P (i.e subsequence starting
and ending with a character from Σ) containing exactly L non-wildcard characters (residues) has length of at most W.
Ex. L=3 and W=5, “CD..E” is a (3, 5) pattern.
![Page 7: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/7.jpg)
TEIRESIAS Algorithm Target:
Search for patterns consisting of characters of the alphabet Σ and wild cards ‘.’
Basic Idea: If a pattern P is a (L, W) pattern occurring in
at least K sequences, then its sub patterns are also (L, W) patterns occurring in at least K sequences (K>=2).
Ex: pattern “A.BC” is more specific than “A..C”
Designed for non-aligned sequences
![Page 8: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/8.jpg)
Two Phases Scan Phase:
Scan the input sequences Identify all the elementary patterns which
satisfy the minimum support Convolution Phase:
Sort the element patterns and extract useful prefixes and suffixes
Convolve the elementary patterns until find the maximal length and composition of the patterns
![Page 9: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/9.jpg)
Scan Phase
Input: Non-aligned sequences Parameter: K, L, W
Output: a set of “Elementary Patterns” Are (L, W) patterns Occur in at least K sequences Contain exactly L non-wildcards
![Page 10: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/10.jpg)
Elementary Pattern Examples
SDFBASTSLP
LFCASTSCP
FDASTSNP
TGLPACTTCPTFCP
AST
K=2, L=3, W=5
T.LP
![Page 11: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/11.jpg)
Scan Phase (Cont.) Empty the stack of elementary patterns.
(EP) For each letter in the alphabet, count
how many sequences contain this letter. If less than K sequences contain this
letter, ignore it. Otherwise, extend it until it is ignored or
it is accepted.
![Page 12: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/12.jpg)
Convolution Phase (Overall)
There is a sub-phase named pre-convolution before the convolution phase.
The goal of the convolution phase is to extend a elementary pattern with other elementary patterns.
![Page 13: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/13.jpg)
Pre-convolution Phase Prefix of a pattern p is the unique
substring containing the first L-1 residues of p.
Suffix of pattern p is the unique substring containing the last L-1 residues of p.
Example: for pattern a..c.e, prefix is a..c, suffix is c.e
![Page 14: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/14.jpg)
Pre-convolution Phase (cont.)
Given two pattern p and q, p convolve q is empty if the suffix of p is not the prefix of q,
Otherwise, it is the pattern p followed by the part of q which remains when its prefix is removed.
![Page 15: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/15.jpg)
Pre-convolution Phase (cont.) Left vector of a pattern q contains all the
patterns p that can be left-convolved with it.
Right vector of a pattern p contains all the patterns q that can be right-convolved with it.
Example: Left a.c.d: ea.c b.a.c right a.c.d: fc.d
![Page 16: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/16.jpg)
Pre-convolution Phase (cont.)
Sort elementary patterns by prefix-wise <
Construct Left and Right vector for each elementary pattern
![Page 17: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/17.jpg)
Prefix-wise < Examples
Example 0 abc 5 bc..f 1 bcd 6 a.cd 2 ab.d 7 b.de 3 cd.f 8 a.c.e 4 ab..e 9 a..de
![Page 18: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/18.jpg)
Convolution Phase For each pattern t of the sorted EPs,
firstly it is extended to the left by convolving it with every member in Left[t]. If the result r doesn’t have sufficient support or is redundant, discard it.
If r has the same number of occurrences as t, t is non-maximal and is discarded.
![Page 19: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/19.jpg)
Convolution Phase (cont.) If result r has adequate support and is not
redundant, r is placed on the stack and becomes the new top.
When the pattern at the top of the stack can not be extended to the left any more, extend it to the right.
When all the extensions have been done, the current top is popped from the stack. If it is maximal, it is put into the output set.
![Page 20: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/20.jpg)
An Example Scenario
Suppose there are three sequences:
SDFEASTS LFCASTS FDASTSNP Our goal: to find a maximal
pattern, given K = 2, L = 3, W = 5
![Page 21: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/21.jpg)
An Example Scenario (cont.) First, the scan phase is implemented The elementary patterns (EP) found are: AST AS.S A.TS F.AS F.A.T F..ST STS Then these EP are sorted, the result is: AST STS AS.S A.TS F.AS F.A.T F..ST
![Page 22: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/22.jpg)
An Example Scenario (cont.) Then, the Left and Right vectors are
constructed: Left Right AST: F.AS AST: STS STS: AST, F..ST STS: AS.S: F.AS AS.S: A.TS: F.A.T A.TS: F.AS: F.AS: AST, AS.S F.A.T: F.A.T: A.TS F..ST: F..ST: STS
![Page 23: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/23.jpg)
An Example Scenario (cont.)
Now the convolution phase is executed, and the result is as following:
F.ASTS
![Page 24: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/24.jpg)
About Our Implementation The program is written using Visual C+
+ 6.0 Command line arguments are provided
to users for them to specify the input data file, the K, L, W value (which have default values (2, 3, 5) if users don’t specify)
>Teiresias <Filename> [<K>] [<L>] [<W>]
![Page 25: An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649ee95503460f94bfac56/html5/thumbnails/25.jpg)
Questions?