repeats and palindromes: an overview · 2015-02-17 · outline introductory concepts tandem repeats...

Post on 18-Jul-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Repeats and Palindromes: an Overview

Sara H. GeizhalsAdvisor: Dina Sokol

The Graduate Center

30 July 2014

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Pattern matching problem:

I Let Σ be an alphabet.I INPUT:

I string text, of length n: T = t1t2 . . . tnI string pattern, of length m: P = p1p2 . . . pmI ti , pi ∈ Σ

I OUTPUT: all positions i in T , where there is an occurrenceof P, i.e., T [i + k] = P[k], 1 ≤ k ≤ m

I Example: P = ababc in T = cababccdddddababccc

I Example: P = aba in T = cabaabadddababac

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

I Naive solution - O(nm): since each text position is a possiblepattern start, compare each text position with the pattern.

I Better solution: KMP (Knuth, Morris, and Pratt, 1977)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

I Naive solution - O(nm): since each text position is a possiblepattern start, compare each text position with the pattern.

I Better solution: KMP (Knuth, Morris, and Pratt, 1977)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Stage 1: O(m) to construct KMP automaton for P

Figure: KMP automaton for P of ababc

I Each of m + 1 nodes has success arc (to character on right)and failure arc (to border)

I Border of X = longest substring that is both proper prefix andproper suffix of X

I Example: abab’s border is ab, ababa’s border is aba, andababc’s border is empty string

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Stage 2: O(n) to go through characters of T to look for P

I If character in T matches first character of P, algorithm startsat first node of automaton

I For each character match, automaton follows success arc

I For each character mismatch, automaton follows failure arc

I If reaches last node of automaton, then P is in T

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

KMP

Example: P = ababc, T = cabab . . .

T = cabab . . . T = cabab . . .

If T ’s next character is:

I c - success arc to node 5, and P is in T

I a - failure arc to node 2 (border of abab)

I anything else

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Indexing problem:

I Same as pattern matching problem, except that searches forany number of pattern strings in one text string

I Therefore, want queries to be answered in time proportionalto (short) Ps, rather than (long) T

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Figure: ST for T of xabxac

Stage 1 of solution for indexing problem: construct suffix tree(ST), which is trie of all suffixes of T

I O(n) construction, when alphabet size is constant, as perWeiner (1973), McCreight (1976), and Ukkonen (1995)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Stage 2 of solution for indexing problem:O(m) to decide whether P is in T :

I Start at root

I Use characters of P to follow paths of ST

I If gets stuck, then P is not in T ; if does not get stuck, then Pis in T

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Stage 2 of solution for indexing problem:O(m) to decide whether P is in T :

I Start at root

I Use characters of P to follow paths of ST

I If gets stuck, then P is not in T ; if does not get stuck, then Pis in T

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Stage 2 of solution for indexing problem:O(m) to decide whether P is in T :

I Start at root

I Use characters of P to follow paths of ST

I If gets stuck, then P is not in T ; if does not get stuck, then Pis in T

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

T = xabxac

I P = xab is in T

I P = xaa is not in T

Figure: ST for T of xabxac

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I Use GeneralizedSuffix Tree (GST)to index severaltexts

I Numbers at leaf =text, startingposition of suffixin text Figure: GST for T1 = xabxa$ and T2 =

babxba$

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

I LCP (v, w) = length of longest common prefix betweenstrings v and w

I LCA (lowest common ancestor) of tree nodes v and w =lowest node with v and w as descendants

I Harel and Tarjan (1984): preprocess tree with n nodes in O(n)time; following that, can answer LCA queries in constant time

I Solution to LCP problem on strings is same as solution toLCA problem on ST (to be shown)

I Therefore, ST solves LCP with the same runtime

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Solution to LCP problem onstrings is same as solution to LCAproblem on ST

I LCA (6, 9) = 10;LCP (babba$, bba$) = b

I LCA (1, 2) = 3;LCP (ababba$, abba$) = ab

I LCA (5, 9) = root;LCP (a$, bba$) = emptystring

Figure: ST for T of ababba$

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Solution to LCP problem onstrings is same as solution to LCAproblem on ST

I LCA (6, 9) = 10;LCP (babba$, bba$) = b

I LCA (1, 2) = 3;LCP (ababba$, abba$) = ab

I LCA (5, 9) = root;LCP (a$, bba$) = emptystring

Figure: ST for T of ababba$

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

KMPSuffix Tree

Suffix Tree

Solution to LCP problem onstrings is same as solution to LCAproblem on ST

I LCA (6, 9) = 10;LCP (babba$, bba$) = b

I LCA (1, 2) = 3;LCP (ababba$, abba$) = ab

I LCA (5, 9) = root;LCP (a$, bba$) = emptystring

Figure: ST for T of ababba$

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Tandem Repeats

I DNA contains consecutive copies of pattern, e.g. cagcagcag

I Known as tandem repeat (TR)I Identification of TRs is useful

I Linked to over thirty hereditary disorders in humansI Used as genetic markers in DNA fingerprinting, mapping

genes, comparative genomics, and evolution studiesI Used in population studies and conservation biology

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of rI abc abc abc abc vs. abcabc abcabc

I h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of rI abc abc abc abc vs. abcabc abcabc

I h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of r

I abc abc abc abc vs. abcabc abcabcI h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I A string r = a1 . . . an is periodic if there exists some p suchthat ai = ai+p for all i , when 1 ≤ i , i + p ≤ n

I r = uhu′

I u or |u| = period of rI abc abc abc abc vs. abcabc abcabc

I h = 4, u = abc = 3 vs. h = 2, u = abcabc = 6

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I PrimitiveI If not periodic: e.g., abcdI If periodic: last period is partial - e.g., abcdabcdab

I The period = the primitive periodI abc abc abc abc - not abcabc abcabc

I h = 4, u = abc = 3 - not h = 2, u = abcabc = 6

I Periodic string also called repetition or exact tandem repeat(ETR) - a repeat, for short.

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I PrimitiveI If not periodic: e.g., abcdI If periodic: last period is partial - e.g., abcdabcdab

I The period = the primitive periodI abc abc abc abc - not abcabc abcabc

I h = 4, u = abc = 3 - not h = 2, u = abcabc = 6

I Periodic string also called repetition or exact tandem repeat(ETR) - a repeat, for short.

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I PrimitiveI If not periodic: e.g., abcdI If periodic: last period is partial - e.g., abcdabcdab

I The period = the primitive periodI abc abc abc abc - not abcabc abcabc

I h = 4, u = abc = 3 - not h = 2, u = abcabc = 6

I Periodic string also called repetition or exact tandem repeat(ETR) - a repeat, for short.

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I r = caaaad or cababababadI Periodic in all characters but first and last

I Search r for repeats - in particular, for maximal repeats =period cannot be extended right or left by even one character

I r = caaaad:I repeat aaaa is maximalI repeats aa and aaa are not maximal

I The repeat = the maximal repeat

I Algorithms to find repeats in string r of length n:I O(n log n) - Crochemore (1981), Apostolico (1983),

Main and Lorentz (1984)I O(n) - KK = Kolpakov and Kucherov (1999)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I r = caaaad or cababababadI Periodic in all characters but first and last

I Search r for repeats - in particular, for maximal repeats =period cannot be extended right or left by even one character

I r = caaaad:I repeat aaaa is maximalI repeats aa and aaa are not maximal

I The repeat = the maximal repeat

I Algorithms to find repeats in string r of length n:I O(n log n) - Crochemore (1981), Apostolico (1983),

Main and Lorentz (1984)I O(n) - KK = Kolpakov and Kucherov (1999)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Exact Tandem Repeat

I r = caaaad or cababababadI Periodic in all characters but first and last

I Search r for repeats - in particular, for maximal repeats =period cannot be extended right or left by even one character

I r = caaaad:I repeat aaaa is maximalI repeats aa and aaa are not maximal

I The repeat = the maximal repeat

I Algorithms to find repeats in string r of length n:I O(n log n) - Crochemore (1981), Apostolico (1983),

Main and Lorentz (1984)I O(n) - KK = Kolpakov and Kucherov (1999)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

I DNA has mutations, translocations, reversals, etc.

I Approximate tandem repeat or ATRI Metrics include:

I Hamming distance - mismatchI one character substituted by another characterI 1 mismatch between abc and abd

I Edit distance - errorI one character substituted by another character or one

character inserted or deletedI 2 errors between abc and abdd

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

Search for approximate tandem repeat

Types:I Group 1: neighboring - k-repetition, k-run, k-edit repeat

I Each occurrence is similar to adjacent (or neighboring)occurrences

I Group 2: consensus - consensus model, k-MARI Each occurrence is similar to consensus string

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 1: neighboring - k-repetition

Definition(KK, 2003) A string r [1 . . . n] is called a k-repetition of period p,p ≤ n/2, iff the Hamming distance of r [1 . . . n − p] andr [p + 1 . . . n] ≤ k.

I ataa atta ctta ct is two-repetition, of period four (total of twomismatches, when comparing each occurrence to nextoccurrence)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 1: neighboring - k-run

Definition(KK, 2003) A string r [1 . . . n] is called a k-run, of period p,p ≤ n/2, iff for every i ∈ [1 . . . n − 2p + 1], the Hamming distanceof r [i . . . i + p − 1] and r [i + p . . . i + 2p − 1] ≤ k .

One-run, of period four (up to one mismatch, when comparingeach occurrence to next occurrence):

I ataa atta ctta ggta ag - four mismatches

I ataa atta atta atta at - one mismatch

I Total number of mismatches unbounded

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 1: neighboring - k-edit repeat

Definition(Sokol 2007, Sokol 2013) A string r is called a k-edit repeat if itcan be partitioned into consecutive substrings, r = u′′u1u2...ulu

′,l ≥ 2, such that the sum of edit distance errors of each occurrencewith its neighbor does not exceed k.

I k-edit repeat = k-repetition, except that uses edit distanceI caagct ccagct ccgc is two-edit repeat

I ccgc is partial period, so it is not counted as deletion of tI with edit distance, period size is not fixed and so it not given

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

Definition(Sagot and Myers, 1998) Each occurrence is called a wagon, andthe sequence of wagons is a train. A prefix model is when the totalnumber of edit distance errors, from each wagon to the consensusstring, does not exceed k . Jumps of a certain size are allowed (ajump is from the left of one wagon to the left of the next wagon).A consensus model is a prefix model, with an additional allowance- that gaps of a certain size are allowed (a gap is from the right ofone wagon to the left of the next wagon).

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

I Search for trains in satellite DNA (type of DNA that islocated near region of centromere and is involved inchromosome segregation during mitosis and meiosis)

I Algorithm outputs any number of trains, where all of thefollowing hold true:

I there are at least min repeat wagons in a train, and eachwagon’s length is between min range and max range (wheremin repeat, min range, and max range are inputs)

I there are at most k = 15-20% edit distance errors from eachwagon to the consensus string (where k is an input)

I jumps and gaps are of allowed sizes

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

Stage one: filtering = which substrings likely contain repeatsI Sum values in upper right region of local alignment dynamic

programming matrix (Smith-Waterman, 1981)I ETRs tend to have high sum - e.g., gacgac - sum of 20I non-ETRs tend not to - e.g., gtcagt - sum of 9

g a c g a c

0 0 0 0 0 0 0g 0 1 0 0 1 0 0a 0 0 2 1 0 2 1c 0 0 1 3 2 1 3g 0 1 0 2 4 3 2a 0 0 2 1 3 5 4c 0 0 1 3 2 4 6

g t c a g t

0 0 0 0 0 0 0g 0 1 0 0 0 1 0t 0 0 2 0 0 0 2c 0 0 0 3 1 0 0a 0 0 0 1 4 2 0g 0 1 0 0 2 5 3t 0 0 2 0 0 3 6

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - consensus model

Stage two: brute force search of substrings that were not filtered

I Set Lm stores left-end positions of wagons known to be validfor consensus string m

I Attempt to find another valid consensus string by taking thispreviously good consensus string m and inserting to left,substituting on left, or deleting on left

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Group 2: consensus - k-MAR

Definition(Amit, 2013) A k-maximal approximate run (or k-MAR) is anon-empty string r , such that the modification of at most kcharacters in r generates a maximal (exact) run. A modification isa substitution of a single character with a different single character.

I aba abc aba - one-MAR of period three and three occurrences

I ab ab ab ac ac ab - two-MAR of period two and six occurrences

I k-MAR is consensus, since string will be modified in order tomatch consensus string

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

Figure: Alternative Grouping of ATRS

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

1. Hamming distance mismatches vs. edit distance errors

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

2. Overall, all occurrences similar to all others or whether string isevolutive = only adjacent occurrences must be similar (it is possiblethat evolve so much that far occurrences bear little resemblance)

I Consensus model - occurrences must be somewhat similar, asall occurrences similar to consensus string

I k-MARs are unusual - even though grouped as consensus inoriginal grouping, they do not require similar occurrences (ifall k mismatches are in few occurrences, these fewoccurrences may be non-similar to other occurrences)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

3. Counting variations is total count (count over all occurrences)or individual count (counter is reset for each occurrence)

4. Σv = total number of variations - bounded (Σv does notexceed k = fixed number known in advance) or not (Σvdepends on number of occurrences, so Σv has no boundknown in advance)

I Above two seem similar, but are differentI k-repetition, k-edit repeat, and k-MAR - total count, Σv

bounded by kI k-run - individual count, Σv unbounded (k mismatches

allowed for each occurrence, so Σv unbounded)I Consensus model unusual - individual counter (counter is

reset), yet Σv bounded (algorithm works when ≤ 15-20% editdistance errors between input string and consensus string)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Approximate Tandem Repeat

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Related Research

Nested Tandem Repeat (NTR) = two distinct TRs, where one isnested within the other - e.g., NTR of the form XxxxXxxxxXxxXxxXxXxxx, such as:

CTTGAGATT ACAT ACAT ACATCTTGAGATT ACAT ACAT ACAT ACATCTTGAGATT ACAT ACATCTTGAGATT ACAT ACATCTTGAGATT ACATCTTGAGATT ACAT ACAT ACAT

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Exact Tandem RepeatApproximate Tandem RepeatRelated Research

Related Research

(Amir, 2010 and Eisenberg, 2012) Period Recovery Problem:

I String r is periodic, but unknown.

I INPUT: string r ′ = r corrupted with up to one edit distanceerror per occurrence, on average.

I OUTPUT: What’s r ’s period? Identify the candidate set.I Example: r’ = abcabdabeabcabcabe:

I abcabdabe (abcabdabe abcabcabe)I abcabcabe (abcabdabe abcabcabe)I abc (abc abd abe abc abc abe)

I Example: r’ = abbabdacc:

a. abc (abb abd acc) c. abb (abb abd acc)b. abd (abb abd acc) d. not acc (abb abd acc)

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palindromes

String r = uauR is palindrome

I radius of palindrome =length of one arm

I diameter of palindrome =sum of lengths of arms

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palindromes

I The palindrome refers to maximal oneI r = abababaI the palindrome is abababaI not maximal: aba

I Search string for substrings that are palindromes

I One common approach is to find longest palindrome centeredat each position

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palindromes

I The palindrome refers to maximal oneI r = abababaI the palindrome is abababaI not maximal: aba

I Search string for substrings that are palindromes

I One common approach is to find longest palindrome centeredat each position

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #1 for (Exact) Palindromes

I Manacher (1975): first linear-time algorithm to find (exact)palindromes

position 1 2 3 4 5 6 7 8 9 10 11 12character b a b c b a b c b a c calgorithm 0 1 3 1 7 1 9 1 5 1 1 1 1 0

Table: Longest palindrome centered at each position

I Intuition is that if r = uauR , then lengths of palindromescentered at positions of left arm, u, will mostly be mirrorimage of lengths of palindromes centered at positions of rightarm, uR . Therefore, calculations of values in right arms arebased on previously-calculated values of left arms.

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #1 for (Exact) Palindromes

I Manacher (1975): first linear-time algorithm to find (exact)palindromes

position 1 2 3 4 5 6 7 8 9 10 11 12character b a b c b a b c b a c calgorithm 0 1 3 1 7 1 9 1 5 1 1 1 1 0

Table: Longest palindrome centered at each position

I Intuition is that if r = uauR , then lengths of palindromescentered at positions of left arm, u, will mostly be mirrorimage of lengths of palindromes centered at positions of rightarm, uR . Therefore, calculations of values in right arms arebased on previously-calculated values of left arms.

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #2 for (Exact) Palindromes

I Create, in linear time, GST of r and of rR . Preprocess inorder to be able to answer LCA query (which, for strings, isLCP query) in constant time.

I For each position, perform type of LCP query called longestcommon extension (or LCE) query (see table below).

I LCE query for position x = how many positions on right of xmatch positions on left of x .

I If any LCP query results in non-zero value, then there ispalindrome, with radius size of that value, centered there.

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #2 for (Exact) Palindromes

LCP queries:

I abccba -palindrome withradius three(diameter six)

I abbaba -palindrome withradius two(diameter four)

abccba bccba ccba cba ba a

abccba a ba cba ccba bccba

LCP 0 0 3 0 0

abbaba bbaba baba aba ba a

ababba a ba bba abba babba

LCP 0 2 0 0 0

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Linear-time Algorithm #2 for (Exact) Palindromes

LCP queries:

I cabbad -palindrome withradius two(diameter four)

I abdcba - nopalindrome, sinceevery LCP queryreturns zero

cabbad abbad bbad bad ad d

dabbac c ac bac bbac abbac

LCP 0 0 2 0 0

abcdba bcdba cdba dba ba a

abdcba a ba cba dcba bdcba

LCP 0 0 0 0 0

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palstar

Palstars = concatenation of palindromesI KMP (1977): concatenation of even-length palindromes: a

string is a palstar iff r = a1, . . . , an, where each ai , 1 ≤ i ≤ n,is a palindrome of even length.

I Example: string of length zero, abba, abbacc, not abaabba

I Galil (1978): a string is a palstar iff r = a1, . . . , an, whereeach ai , 1 ≤ i ≤ n, is a palindrome, of any length (even orodd), and |r | ≥ 2.

I Example: not string of length zero, not abba, abbacc, abaabba

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Palstar

Palstars = concatenation of palindromesI KMP (1977): concatenation of even-length palindromes: a

string is a palstar iff r = a1, . . . , an, where each ai , 1 ≤ i ≤ n,is a palindrome of even length.

I Example: string of length zero, abba, abbacc, not abaabba

I Galil (1978): a string is a palstar iff r = a1, . . . , an, whereeach ai , 1 ≤ i ≤ n, is a palindrome, of any length (even orodd), and |r | ≥ 2.

I Example: not string of length zero, not abba, abbacc, abaabba

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Gapped Palindrome

(KK, 2008) Gapped palindrome = uauR , |a| ≥ 2

Types:

1. long-armed: |u| > |a| - e.g., abbbcddbbba

2. length-constrained: |u| and |a| constrained by inputted sizes

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Run-length encoding (RLE): encode string as runs of consecutiveand identical symbols

I bb cccc e dd aaaaa =I b2c2c2e1d2a2a2a1 (not optimal RLE)I b2c4e1d2a5 (optimal RLE)

Algorithm from Chen (2012):

I INPUT: string r of length n in optimal RLE-form, specifiedcenter position c , allowed number of Hamming distancemismatches k

I OUTPUT: radius of approximate (≤ k Hamming distancemismatches) palindrome in r centered at c

I Without decompressing

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Run-length encoding (RLE): encode string as runs of consecutiveand identical symbols

I bb cccc e dd aaaaa =I b2c2c2e1d2a2a2a1 (not optimal RLE)I b2c4e1d2a5 (optimal RLE)

Algorithm from Chen (2012):

I INPUT: string r of length n in optimal RLE-form, specifiedcenter position c , allowed number of Hamming distancemismatches k

I OUTPUT: radius of approximate (≤ k Hamming distancemismatches) palindrome in r centered at c

I Without decompressing

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Preprocessing: construct array A[1 · · ·m], where m = number ofruns in compressed string. Array elements are start positions ofrun, when decompressed.

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

RLE (r) a3 b1 a2 c2 a3 b1 c2 a2 b2

Table: A for r = aaabaaccaaabccaabb. There are m = 9 runs.

LCP (RLE (r), 5, 6) = 2, since X4 = X7 and X3 = X8 (samecharacter and same number of that character), but X2 6= X9

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Preprocessing: construct array A[1 · · ·m], where m = number ofruns in compressed string. Array elements are start positions ofrun, when decompressed.

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

RLE (r) a3 b1 a2 c2 a3 b1 c2 a2 b2

Table: A for r = aaabaaccaaabccaabb. There are m = 9 runs.

LCP (RLE (r), 5, 6) = 2, since X4 = X7 and X3 = X8 (samecharacter and same number of that character), but X2 6= X9

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

I Stage one: binary search on RLE (r), in order to find in whichrun c is.

I Stage two: iterate, in order to extend palindrome outward, asmuch as possible.

I Iteration ends when hits run boundary on left or on right.I If hit left and right simultaneously, then additional LCP query

is performed, so that algorithm can “jump” over other runs, inorder to extend even more.

I End iterating when use k mismatches or when hit stringboundary (position 1 or n).

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

I Stage one: binary search on RLE (r), in order to find in whichrun c is.

I Stage two: iterate, in order to extend palindrome outward, asmuch as possible.

I Iteration ends when hits run boundary on left or on right.I If hit left and right simultaneously, then additional LCP query

is performed, so that algorithm can “jump” over other runs, inorder to extend even more.

I End iterating when use k mismatches or when hit stringboundary (position 1 or n).

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

INPUT:

I RLE (r) for aaabaaccaaabccaabb = a3ba2c2a3bc2a2b2

I c = 10.5

I k = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Stage one - binary search: c is in X5

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

INPUT:

I RLE (r) for aaabaaccaaabccaabb = a3ba2c2a3bc2a2b2

I c = 10.5

I k = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

Stage one - binary search: c is in X5

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Stage two - iterations:

1. Extend to left and right of position 10.5,to positions 10 and 11, until right hitsboundary.

radius = 1mismatches = 0

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

2. Extend to positions 9 and 12, which donot match, so 1 mismatch is found. Leftboundary hits X5 at same time that rightboundary hits X6, so there is query of LCP(RLE (r), 5, 6), which returns two (asexplained above), meaning that two runswill be jumped over, and next iterationlooks at X2 and X9.

radius = 6mismatches = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Iterations:

3. X2 and X9 contain b. Left hits boundaryof X2 before right hits boundary.

radius = 7mismatches = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Approximate Palindromes in RLE-compressed Data

Iterations:

4. X1 and X9 contain different characters.No more mismatches allowed, soalgorithm stops.

radius = 7mismatches = 1

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

character a a a b a a c c a a a b c c a a b b

run X1 X2 X3 X4 X5 X6 X7 X8 X9

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Introductory ConceptsKMPSuffix Tree

Tandem RepeatsExact Tandem RepeatApproximate Tandem Repeat

Group 1: neighboring - k-repetitionGroup 1: neighboring - k-runGroup 1: neighboring - k-edit repeatGroup 2: consensus - consensus modelGroup 2: consensus - k-MAR

Related ResearchPalindromes

Linear-time Algorithm #1 for (Exact) PalindromesLinear-time Algorithm #2 for (Exact) PalindromesVariations of PalindromesApproximate Palindromes in RLE-compressed Data

Research Directions

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Research Directions

Warburton (2004):

I Palindrome = inverted repeat (IR)I IR = substring followed by gap and then by substring’s reverse

= gapped palindromeI (and when gap is length zero, then IR = palindrome)

I Repeats and palindromes really are similar problems

I We want to apply research from repeats towards palindromes

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Research Directions

2D palindrome = square block of text rotated around centerresults in same square block of text

OutlineIntroductory Concepts

Tandem RepeatsPalindromes

Research Directions

Research Directions

I Discussed how to find palindromes in RLE-compressed data,without decompressing

I Want to find palindromes (or 2D palindromes) in data thatwas compressed using other compression schemes, withoutdecompressing

top related