28 schmidt
TRANSCRIPT
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 1/46
Quadratic Time Algorithms
for Finding Common Intervals
in Two and More Sequences
Thomas Schmidt
Jens Stoye
CPM 2004, Istanbul
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 2/46
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 3/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 3
•
Observations:- Gene order in bacterial genomes is weakly conserved
- Some genes tend to cluster together even in unrelated species
- Functional association of genes inside a cluster
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 4/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 4
•
Observations:- Gene order in bacterial genomes is weakly conserved
- Some genes tend to cluster together even in unrelated species
- Functional association of genes inside a cluster
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 5/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 5
•
Observations:- Gene order in bacterial genomes is weakly conserved
- Some genes tend to cluster together even in unrelated species
- Functional association of genes inside a cluster
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 6/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 6
?
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 7/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 7
?
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 8/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 8
?
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 9/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 9
Are there more clusters ?
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 10/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 10
Are there more clusters ?
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 11/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 11
Task:
• Establish a model and search for gene clusters
Gene Order and Function in Bacteria:
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 12/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 12
Formalization of Gene Clusters:
Genomes: permutations π
1
, π 2
,…,
π
k Genes: numbers 1 ,…,n
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 13/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 13
Formalization of Gene Clusters:
Genomes: permutations π
1
, π 2
,…,
π
k Genes: numbers 1 ,…,n
π 1
π 2
π 3
π 4
1 2 3 4 5 6 7 8
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 14/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 14
Formalization of Gene Clusters:
Genomes: permutations π
1
, π 2
,…,
π
k Genes: numbers 1 ,…,n
π 1
π 2
π 3
π 4
1 2 3 4 5 6 7 8
8 7 6 4 5 2 1 3
3 1 2 5 8 7 6 4
6 7 4 2 1 3 8 5
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 15/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 15
Formalization of Gene Clusters:
1 2 3 4 5 6 7 8
8 7 6 4 5 2 1 3
3 1 2 5 8 7 6 4
6 7 4 2 1 3 8 5
π 1
π 2
π 3
π 4
Genomes: permutations π
1
, π 2
,…,
π
k Genes: numbers 1 ,…,n
Gene cluster: common interval subset of numbers occurring
contiguously in all permutations)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 16/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 16
Formalization of Gene Clusters:
1 2 3 4 5 6 7 8
8 7 6 4 5 2 1 3
3 1 2 5 8 7 6 4
6 7 4 2 1 3 8 5
π 1
π 2
π 3
π 4
Genomes: permutations π
1
, π 2
,…,
π
k Genes: numbers 1 ,…,n
Gene cluster: common interval subset of numbers occurring
contiguously in all permutations)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 17/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 17
Formalization of Gene Clusters:
1 2 3 4 5 6 7 8
8 7 6 4 5 2 1 3
3 1 2 5 8 7 6 4
6 7 4 2 1 3 8 5
π 1
π 2
π 3
π 4
Genomes: permutations π
1
, π 2
,…,
π
k Genes: numbers 1 ,…,n
Gene cluster: common interval subset of numbers occurring
contiguously in all permutations)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 18/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 18
Formalization of Gene Clusters:
Algorithms:- Uno & Yagiura, Algorithmica 2000 : Find all common intervals of
two permutations in O(n+|output|) time.
- Heber & Stoye, CPM 2001: Find all common intervals of k ≥ 2permutations in O(kn+|output|) time.
Genomes: permutations π 1 , π 2 ,…, π k
Genes: numbers 1 ,…,n
Gene cluster: common interval subset of numbers occurring
contiguously in all permutations)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 19/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 19
Modeling multiple copies of a gene (paralogs):
Problem:
- Gene duplication results in multiple copies of a gene inside
a genome
- Difficult to assign the correct gene pair
1 2 3 4 5 6 7 8
π 1
π 2
π 3
7 ?
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 20/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 20
Modeling multiple copies of a gene (paralogs):
Problem:
- Gene duplication results in multiple copies of a gene inside
a genome
- Difficult to assign the correct gene pair
1 2 3 4 5 6 7 8
π 1
π 2
π 3
? 7
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 21/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 21
Modeling multiple copies of a gene (paralogs):
Problem:
- Gene duplication results in multiple copies of a gene inside
a genome
- Difficult to assign the correct gene pair
1 2 3 4 5 6 7 8
π 1
π 2
π 3
3 1 2 ? ?
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 22/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 22
Modeling multiple copies of a gene (paralogs):
Problem:
- Gene duplication results in multiple copies of a gene inside
a genome
- Difficult to assign the correct gene pair
1 2 3 4 5 6 7 8
π 1
π 2
π 3
3 ? 2 1 ?
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 23/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 23
Modeling multiple copies of a gene (paralogs):
Solution:
- Do not distinguish between paralogous gene copies
- Each paralogous copy of a gene gets the same number
Consequence:
- Genomes are modeled as sequences instead of
permutations
1 2 3 4 5 6 7 8
S 1
S 2
S 3
3 1 2 4 8 7 6 1 2
8 7 6 7 5 4 2 1 3
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 24/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 24
Overview:
• Introduction- Comparative genomics
- Common Intervals and Gene Clusters
• Formal Model
• Algorithms- Simple Data Structure: Quadratic Space
- Saving Space
• Results
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 25/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 25
Formal Model:
Given: String S over a finite alphabet Σ
Notation: S [i] = the i-th character of S
S [i,j] = substring of S starting at index i and ending at j
Definition: The character set CS (S [i,j]) := {S [k ] | i ≤ k ≤ j} is
the set of all characters occurring in the substring
S [i,j].
Example:
CS (S [2,5]) := {1,2,3}
1 2 3 4 5 6 7 8
S : 3 1 2 3 1 5 2 6
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 26/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 26
Formal Model:
Given: Subset C Σ
Definition: (i, j) is a CS-location of C in S , iff CS (S [i,j]) = C
left-maximal = S [i-1] C
right-maximal = S [ j+1] C maximal = both left- and right-maximal
Example:S : 3 1 2 3 1 5 2 6
1 2 3 4 5 6 7 8
The pair (3,5) is a CS-location of the set C={1,2,3},
because CS (S [3,5]) = {1,2,3}, but it is not left-
maximal !
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 27/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 27
Formal Model:
Given: Collection of k strings S* = (S 1 ,...,S k ) over alphabet Σ
Definition: C Σ is a common CS-factor of S* if and only if
C has a CS-location in each S l , 1 ≤ l ≤ k .
Example:
0 1 2 3 4 5 6 7
S 1 : 3 2 1 3 1 5 1 6
S 2 : 4 3 5 5 5 1 4 2 2
S 3: 7 5 1 5 3 6 5
1 2 3 4 5 6 7 8 9
common CS-factor: {1,3,5} => S 1: (3,7) ― S 2: (2,6) ― S 3: (2,5)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 28/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 28
Problem Formulation:
A common CS-factor of k strings represents a gene cluster thatoccurs in each of the k genomes.
Given a collection of k strings S*:
Problem 1: Find all common CS-factors in S*.
Problem 2: For each common CS-factor find all its maximal
CS-locations in each of the strings.
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 29/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 29
Overview:
• Introduction
• Formal Model
• Algorithms
- Simple Data Structure: Quadratic Space
- Saving Space
• Results
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 30/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 30
Algorithm "Connecting Intervals" (CI)
• Algorithm CI solves Problem 1 and Problem 2 for two sequences
• Input: Two sequences of length up to n with characters drawn
from Σ = {1,...,m}, m ≤ 2n
• Output: Pairs of CS-locations of all common CS-factors
• Time & Space complexity: O(n²)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 31/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 31
Preprocessing
POS[1] = 2,5
POS[2] = 3,7
POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 5
5 1 2 3 46 1 2 37 1 28 1
NUM(i, j) : i j
POS[c] holds all positions where character c occurs in S 1.
NUM(i, j) counts the number of different characters in S 1[i, j].
Compute two tables for S 1= (3,1,2,3,1,5,2,6)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 32/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 32
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
S 2 : 4 3 5 5 5 1 4 2 21 2 3 4 5 6 7 8
S 1 : 3 1 2 3 1 5 2 6
ji
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
1 2 3 4 5 6 7 8
1 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
NUM(i, j) : i j
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 33/46
1 2 3 4 5 6 7 8
1 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 33
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
S 2 : 4 3 5 5 5 1 4 2 21 2 3 4 5 6 7 8
S 1 : 3 1 2 3 1 5 2 6
ji
NUM(i, j) : i j
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 34/46
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 35/46
1 2 3 4 5 6 7 8
1 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 35
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
1 2 3 4 5 6 7 8
S 1 : 3 1 2 3 1 5 2 6
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
NUM(i, j) : i j
Output: ((2,2)-(1,1)) ((2,2)-(4,4))
i
S 2 : 4 3 5 5 5 1 4 2 2
j
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 36/46
1 2 3 4 5 6 7 8
1 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 36
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
1 2 3 4 5 6 7 8
S 1 : 3 1 2 3 1 5 2 6
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
NUM(i, j) : i j
Output: ((2,2)-(1,1)) ((2,2)-(4,4))
i
S 2 : 4 3 5 5 5 1 4 2 2
j
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 37/46
1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 37
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
1 2 3 4 5 6 7 8
S 1 : 3 1 2 3 1 5 2 6
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
NUM(i, j) : i j
Output: ((2,2)-(1,1)) ((2,2)-(4,4))
i
S 2 : 4 3 5 5 5 1 4 2 2
j
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 38/46
1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 38
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
1 2 3 4 5 6 7 8
S 1 : 3 1 2 3 1 5 2 6
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
NUM(i, j) : i j
Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6))
i
S 2 : 4 3 5 5 5 1 4 2 2
j
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 39/46
1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 3
7 1 28 1
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 39
Algorithm CI
Algorithm: While reading S 2, mark in S 1 the observed character
and track maximal intervals of marked characters
1 2 3 4 5 6 7 8S 1 : 3 1 2 3 1 5 2 6
POS[1] = 2,5
POS[2] = 3,7POS[3] = 1,4
POS[4] = empty
POS[5] = 6
POS[6] = 8
NUM(i, j) : i j
i
S 2 : 4 3 5 5 5 1 4 2 2
j
(i,j) not left-maximal !
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 40/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 40
1. for i = 1,...,|S 2| do
2. j = i
3. while j < |S 2| and (i,j) is maximal do
4. if (c = S 2[ j]) is seen the first time5. for each entry in POS(c) do
6. mark and track
7. end for
8. end if
9. j = j + 110. end while
11. end for
Time Complexity
Algorithm CI finds all common CS-factors of S 1 and S 2 in O(n²) time.
POS[1] = 1,4
POS[2] = 2,6
POS[3] = 0,3
POS[4] = emptyPOS[5] = 5
POS[6] = 7
S 2 : 4 3 5 5 5 1 4 2 2
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 41/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 41
Multiple Genomes
Goal : Find all common CS-factors of a collection S*=(S 1 ,S 2 ,...,S k )
Algorithm :
1. Apply Algorithm CI to all pairs (S 1,S l ), 2 ≤ l ≤ k
2. Output only the common CS-factor detected in all pairs
Time complexity : O(kn²)
Space complexity : O(kn²) with redundant output, O(n²) otherwise
Further extension : Find all common CS-factors appearing in at
least k' of k strings of S*
Time complexity : O(k( 1+k-k')n²)
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 42/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 42
Saving Space
• Due to the storage of the table NUM , Algorithm CI requiresquadratic space.
• An algorithm presented by Didier, WABI 2003, detects all common
CS-factors of two sequences in O(n² log n) time and linear space
• In a modified version, replacing a binary search by a constant time
Range Maximum Query, it is possible to reduce the time complexity
to O(n²) staying still linear in space.
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 43/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 43
Overview:
• Introduction
- Comparative genomics
- Common Intervals and Gene Clusters
• Formal Model
• Algorithms- Simple Data Structure: Quadratic Space
- Saving Space
• Results
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 44/46
Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 44
Results on real data
• Data set:
- 43 bacterial genome sequences from NCBI
- All classified in the "Clusters of Orthologous Groups of Proteins"
database (COG)
- Genes are identified by their COG number
- Computation time: approx. 5 -10 minutes on a standard PC
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 45/46
Results on real data (k'= 2) all 43 genomes
cluster size ≥ 3
without closely related genomes (k = 32)
cluster size ≥ 2
cluster size ≥ 3
cluster size ≥ 2
8/13/2019 28 Schmidt
http://slidepdf.com/reader/full/28-schmidt 46/46
Teşekkür ederim !