stringology 2004 cri, haifa composition alignment gary benson departments of computer science and...
TRANSCRIPT
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition AlignmentComposition Alignment
Gary BensonGary BensonDepartments of Computer Science and BiologyDepartments of Computer Science and Biology
Boston UniversityBoston University
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Outline of TalkOutline of Talk
1.1. Position Specific PatternsPosition Specific Patterns
2.2. Composition PatternsComposition Patterns
3.3. Composition AlignmentComposition Alignment
4.4. Composition match scoring functionsComposition match scoring functions
5.5. Limiting the length of a composition matchLimiting the length of a composition match
6.6. Growth of local composition alignment scores Growth of local composition alignment scores
7.7. Biological examplesBiological examples
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Position Specific PatternsPosition Specific Patterns
A position specific pattern, P, has the form:A position specific pattern, P, has the form:
P = pP = p1 1 pp2 2 pp3 3 ...... ppkk
where where p pii is either: is either:
• a single character a single character
• a choice of charactersa choice of characters
• a weighted choice of characters a weighted choice of characters
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Types of Position Specific PatternsTypes of Position Specific Patterns
Pattern String:Pattern String: A C T T G C T A C T T G C T
Prosite Type Pattern:Prosite Type Pattern: R K L R K L [ W | S ][ W | S ] →→ R K L R K L WW
Don’t care Characters:Don’t care Characters: K C K C . .. . W W .. T T →→ K C K C R SR S W W LL T T
Regular Expression:Regular Expression: T T T*T* { A | C }*{ A | C }* G G →→ T T T T T TT T T T A C C C A A C C C A A AA A G G
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Weighted PatternWeighted Pattern
Profile:Profile:
Consensus:Consensus:
11 22 33 44 55 66 77 88
AA 1212 44 00 11 00 11 66 00
CC 33 77 00 1111 33 55 00 11
GG 00 22 1616 44 00 44 1010 44
TT 11 33 00 11 1313 55 00 1111
AA CC GG CC TT CC GG TT
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Patterns Based on CompositionPatterns Based on Composition
CompositionComposition is a vector quantity describing the occurrence of each is a vector quantity describing the occurrence of each alphabet letter in a particular string. Let alphabet letter in a particular string. Let SS be a string over be a string over ΣΣ. . Then, Then,
C(S)=(c C(S)=(c σσ1 1 , c, cσσ2 2
, , ccσσ3 3 , … , , … , ccσσ||ΣΣ||
) )
is the composition of is the composition of SS, where , where ccσσii is the count of characters in is the count of characters in SS
that are that are σσii and and ΣΣ c cσσii = | S | = | S |. .
Note that the Note that the orderorder of letters in a string of letters in a string is irrelevantis irrelevant when when describing the composition.describing the composition.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition ExampleComposition Example
S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT
C(S) = ( 3, 5, 4, 6 )C(S) = ( 3, 5, 4, 6 )
A C G TA C G T
An alternate description uses frequencies rather than counts:An alternate description uses frequencies rather than counts:
C(S) = ( 0.17, 0.28, 0.22, 0.33 )C(S) = ( 0.17, 0.28, 0.22, 0.33 )
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Features defined by CompositionFeatures defined by Composition
Our goal is to identify features in sequences that are Our goal is to identify features in sequences that are defined by defined by character compositioncharacter composition rather than position rather than position specific patterns. specific patterns.
In DNA there are such features.In DNA there are such features.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition Based Sequence Features in Composition Based Sequence Features in DNADNA
• Isochores Isochores – Multi-megabase; specifically GC-rich or GC-– Multi-megabase; specifically GC-rich or GC-poor. GC-rich isochores have greater gene density. poor. GC-rich isochores have greater gene density.
• CpG Islands CpG Islands – Several hundred nucleotides; rich in the – Several hundred nucleotides; rich in the dinucleotide CG which is underrepresented in eukaryotic dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these genomes. Methylation of the cystine (C) in these dinucleotides alters gene expression.dinucleotides alters gene expression.
• Protein binding regionsProtein binding regions – Tens of nucleotides; dinucleotide – Tens of nucleotides; dinucleotide composition contributes to DNA flexibility, allowing the composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.helix to change shape during protein binding.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition MatchComposition Match
We hope to identify composition based features in sequences We hope to identify composition based features in sequences using an alignment algorithm which includes composition using an alignment algorithm which includes composition matching. matching.
Two strings, Two strings, SS and and TT, have a , have a composition matchcomposition match if their if their
lengths are equal and lengths are equal and C(S) = C(T)C(S) = C(T). .
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition MatchComposition Match
For example, For example, SS and and TT below have a composition match below have a composition match because they each have:because they each have:
33 A A
S = S = AACTGTCTGTAACCTGGCGCTCCTGGCGCTAATTTT
T = T = AAAAAACCCCCGGGGTTTTTTCCCCCGGGGTTTTTT
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition MatchComposition Match
For example, For example, SS and and TT below have a composition match below have a composition match because they each have:because they each have:
4 4 CC
S = AS = ACCTGTATGTACCCCTGGTGGCCGGCCTATTTATT
T = AAAT = AAACCCCCCCCCCGGGGTTTTTTGGGGTTTTTT
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition MatchComposition Match
For example, For example, SS and and TT below have a composition match below have a composition match because they each have:because they each have:
44 GG
S = ACTS = ACTGGTACCTTACCTGGGGCCGGCTATTCTATT
T = AAACCCCCT = AAACCCCCGGGGGGGGTTTTTTTTTTTT
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition MatchComposition Match
For example, For example, SS and and TT below have a composition match below have a composition match because they each have:because they each have:
6 6 TT
S = ACS = ACTTGGTTACCACCTTGGCGCGGCGCTTAATTTT
T = AAACCCCCGGGGT = AAACCCCCGGGGTTTTTTTTTTTT
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition Alignment ProblemComposition Alignment Problem
GivenGiven:: Sequences Sequences SS and and TT over an alphabet over an alphabet ΣΣ
and a and a composition matchcomposition match scoring function scoring function cm(s, t)cm(s, t) for for any pair of substrings any pair of substrings ss and and tt..
Find:Find: The best scoring alignment of The best scoring alignment of SS with with T T (global or (global or
local)local) where alignments include: where alignments include:
1.1. composition matchcomposition match between substrings of between substrings of SS and and T,T,2.2. single character match, single character match,
3.3. single character mismatch, single character mismatch,
4.4. insertion and deletion.insertion and deletion.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Example of composition alignmentExample of composition alignment
S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC
T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA
AlignmentAlignment
AACAACGTCGTCTTTGTTTGAGCTCAGCTC
| || |<-><-> | | <---><--->
AGCAGCCTGCTGACT-ACT-GCCTAGCCTA
composition matchcomposition match
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Example of composition alignmentExample of composition alignment
S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC
T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA
AlignmentAlignment
AAAACCGTCTTGTCTTTTGAGCTCGAGCTC
|| ||<-> <-> || <---> <--->
AAGGCCCTGACCTGACTT-GCCTA-GCCTA
single character matchsingle character match
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Example of composition alignmentExample of composition alignment
S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC
T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA
AlignmentAlignment
AAAACGTCCGTCTTTTTGAGCTCTGAGCTC
| |<-> | <--->| |<-> | <--->
AAGGCCTGCCTGACACT-GCCTAT-GCCTA
single character single character mismismatchmatch
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Example of composition alignmentExample of composition alignment
S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC
T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA
AlignmentAlignment
AACGTCTTTAACGTCTTTGGAGCTCAGCTC
| |<-> | <--->| |<-> | <--->
AGCCTGACTAGCCTGACT--GCCTAGCCTA
insertion / deletioninsertion / deletion
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Related WorkRelated Work
• Alignment allowing adjacent letter swap. Alignment allowing adjacent letter swap.
O(nm), Lowrance and Wagner (1975)O(nm), Lowrance and Wagner (1975)
• All swapped matchings of a pattern in a text. All swapped matchings of a pattern in a text.
O(nmO(nm1/31/3 log m log|log m log|ΣΣ|), Amir, Aumann, Landau, Lewenstein, |), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)Lewenstein (2000)
O(n log m log O(n log m log ||ΣΣ|), Amir, Cole, Hariharan, Lewenstein, Porat |), Amir, Cole, Hariharan, Lewenstein, Porat (2001)(2001)
• Composition namingComposition naming
O(n log m log O(n log m log ||ΣΣ|), Amir, Apostolico, Landau, Satta (2003)|), Amir, Apostolico, Landau, Satta (2003)
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition Alignment using Composition Alignment using Dynamic ProgrammingDynamic Programming
Given two sequences, Given two sequences, SS and and TT, the best alignment of the prefix strings, the best alignment of the prefix strings
S[1, i] = sS[1, i] = s1 1 …… ssii
T[1, j] = tT[1, j] = t1 1 …… ttjj
ends in one of four ways: ends in one of four ways:
New:New:
1.1. composition match composition match (including single character match)(including single character match)
Standard:Standard:
2.2. mismatch mismatch
3.3. insertion insertion
4.4. deletiondeletion
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Ways an Alignment Can EndWays an Alignment Can End
S: C G TS: C G T
T: C G AT: C G A
S: C A TS: C A T
T: C A -T: C A -
S: C A –S: C A –
T: C A AT: C A A
S: C G T A C S: C G T A C
T: C G C T AT: C G C T A
mismatchmismatch
insertion or deletioninsertion or deletion
composition matchcomposition match
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Ways an Alignment Can EndWays an Alignment Can End
S: C G TS: C G T
T: C G AT: C G A
S: C A TS: C A T
T: C A -T: C A -
S: C A –S: C A –
T: C A AT: C A A
S: C G T A C S: C G T A C
T: C G C T AT: C G C T A
mismatchmismatch
insertion or deletioninsertion or deletion
composition matchcomposition match
Note that the suffixes will have Note that the suffixes will have
a length a length l l wherewhere
1 ≤ 1 ≤ ll ≤ min(i, j, limit) ≤ min(i, j, limit)
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Recursion for Composition MatchRecursion for Composition Match
The score for a composition match in cell [i,j] of the aligment The score for a composition match in cell [i,j] of the aligment matrix, matrix, WW, is , is
ii
jj
W [ i – l, j – l W [ i – l, j – l ]] ++ cm cm (s(si – l + 1 i – l + 1 …… ssii , , ttj – l + 1 j – l + 1 …… ttj j ))
S: C G T A C S: C G T A C
T: C G C T AT: C G C T A
l = 3l = 3
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Time ComplexityTime Complexity
Computing the Computing the optimal composition alignmentoptimal composition alignment with dynamic with dynamic programming is similar to standard alignment, except for programming is similar to standard alignment, except for the composition match scoring option. The overall time the composition match scoring option. The overall time complexity is complexity is
O(nmZ)O(nmZ)
where where ZZ is the time required per is the time required per (i, j)(i, j) pair to find the best pair to find the best
length length ll for the composition match. In the for the composition match. In the worst caseworst case, , every possible length must be tested, every possible length must be tested, Z = min (n, m)Z = min (n, m), , resulting in resulting in O(nO(n33)) time complexity. time complexity.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Computing the shortest Computing the shortest composition matchcomposition match
Our goal is to find the length, Our goal is to find the length, ll, , of the of the shortestshortest suffixes of suffixes of
strings strings S[1, i]S[1, i] and and T[1, j]T[1, j], such that they form a , such that they form a composition match. And do this for every composition match. And do this for every (i, j)(i, j) pair, in pair, in constant time per pair. (Assuming we have a fixed size constant time per pair. (Assuming we have a fixed size alphabet.)alphabet.)
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)
ll 33 44 00 11
For example, letFor example, let
S = AS = AACGACGTCTTTGAGCTTCTTTGAGCT
T = AGCCTT = AGCCTGACGACTTGGTATTGGTA
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)
ll 33 44 00 11
For example, letFor example, let
S = AACGTCTTS = AACGTCTTTGAGTGAGCTCT
T = AGCCTGACTTT = AGCCTGACTTGGTAGGTA
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)
ll 33 44 00 11
For example, letFor example, let
S = AACGTCTTTGAGCTS = AACGTCTTTGAGCT
T = AGCCTGACTTGGTAT = AGCCTGACTTGGTA
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)
ll 33 44 00 11
For example, letFor example, let
S = AACGTCS = AACGTCTTTTGAGCTTTGAGCT
T = AGCCT = AGCCTTGACTTGGTAGACTTGGTA
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition differenceComposition difference
Composition difference Composition difference is a vector quantity for two strings is a vector quantity for two strings SS
and and TT: :
CD(S, T) = (cCD(S, T) = (cσσ11 , … , , … , ccσσ||ΣΣ||
))
where where ccσσii is the number of times is the number of times σσii occurs in occurs in SS minus the minus the
number of times it occurs in number of times it occurs in TT. .
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
ExampleExample
Let Let
ΣΣ = { 0, 1}= { 0, 1}
S = 010111010001000S = 010111010001000
T = 010001101110111T = 010001101110111Then Then
CD (S, T) = (3, – 3) CD (S, T) = (3, – 3)
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Computing Composition DifferenceComputing Composition DifferencePrefix-by-PrefixPrefix-by-Prefix
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Computing Composition DifferenceComputing Composition DifferencePrefix-by-PrefixPrefix-by-Prefix
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Key ObservationKey Observation
Two identical composition differences at prefix lengths Two identical composition differences at prefix lengths hh and and gg indicate a composition match of length indicate a composition match of length h – gh – g..
S = S = 010111010001000010111010001000
T = T = 010001101110111010001101110111
44 99
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Key ObservationKey Observation
Two identical composition differences at prefix lengths Two identical composition differences at prefix lengths hh and and gg indicate a composition match of length indicate a composition match of length h – gh – g..
S = S = 010111010001000010111010001000
T = T = 010001101110111010001101110111
44 99
length = 5length = 5
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences
Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Time complexity for composition matchesTime complexity for composition matches
O(nmO(nmΣΣ)) to record all composition differences and sort to record all composition differences and sort (counting sort) to find (counting sort) to find shortestshortest composition match lengths composition match lengths
for for every (i, j) pairevery (i, j) pair for two strings of length for two strings of length nn and and mm. .
In our work, In our work, ΣΣ, is a small constant, is a small constant (4 for DNA, 16 for (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.Apostolico, Landau and Satta (2003) can be used.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Computing an alignment is one thing. Computing an alignment is one thing. Scoring it is another.Scoring it is another.
How do we How do we score score a composition match? We have explored:a composition match? We have explored:
Functions based on match length, Functions based on match length, kk::
• Function 1: Function 1: cm(k) = ckcm(k) = ck• Function 2: Function 2: cm(k) = ccm(k) = c√ k√ k
where where cc is a constant. is a constant. Functions based on substring composition:Functions based on substring composition:
• Function 3: Function 3: cm(C, B, k) = ck cm(C, B, k) = ck · H(C,B)· H(C,B)where where HH is the is the relative entropyrelative entropy function, function, CC is the composition is the composition of the matching substrings and of the matching substrings and BB is a is a backgroundbackground composition. composition.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Additive and sub-additive scoring functionsAdditive and sub-additive scoring functions
The functions based on length are additive or sub-additive:The functions based on length are additive or sub-additive:
cm(i + j) cm(i + j) ≤ cm(i) + cm(j)≤ cm(i) + cm(j)
Lemma:Lemma: For additive or sub-additive composition match For additive or sub-additive composition match scoring functions, scoring functions, any best scoring alignmentany best scoring alignment is equivalent is equivalent in score to an alignment which contains in score to an alignment which contains only shortest only shortest
composition matches.composition matches.
Theorem: Theorem: Composition alignment with additive or Composition alignment with additive or subadditive match scoring functions and finite alphabet subadditive match scoring functions and finite alphabet has time complexity has time complexity O(nm)O(nm)..
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
The limit parameterThe limit parameter
Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the amount of matching between sequences. If the amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be alignments will not be meaningful.meaningful.
The The limitlimit parameter is an parameter is an upper boundupper bound on the length on the length ll of the of the longest single composition match. It can be used to longest single composition match. It can be used to prevent excessive matchingprevent excessive matching. .
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Investigation of the limit parameterInvestigation of the limit parameter
Sequence Sequence lengthlength
1 1 22 33 44 55 66 77 88 99 1010
Binary Binary (%)(%)
50.050.0 62.562.5 68.768.7 72.772.7 75.675.6 77.377.3 78.978.9 80.380.3 81.381.3 82.482.4
DNA DNA (%)(%)
25.025.0 30.030.0 32.332.3 35.335.3 37.537.5 39.739.7 40.740.7 42.442.4 43.343.3 44.244.2
Percent of characters counted as matching. Percent of characters counted as matching. Sequence length = limit.Sequence length = limit. UngappedUngapped aligned iid sequences. aligned iid sequences.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Percent of characters matching. Sequence length = 100.Percent of characters matching. Sequence length = 100.
Limit < sequence length. Limit < sequence length. Ungapped aligned iid sequences. Ungapped aligned iid sequences.
Note that at limit < 9, more than half the aligned letters are Note that at limit < 9, more than half the aligned letters are expected to form a mismatch.expected to form a mismatch.
C C TT G G GG C C T A A TT A A T
C C AA G G CC C C G G G GG G G G
Investigation of the limit parameterInvestigation of the limit parameter
limitlimit 11 22 55 99 1010
DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5050 5151
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Logarithmic and linear growth Logarithmic and linear growth of alignment scoresof alignment scores
For standard local alignment, the parameter space for match and For standard local alignment, the parameter space for match and mismatch weights is divided into logarithmic and linear regions. mismatch weights is divided into logarithmic and linear regions.
In the logarithmic region, alignment scores grow in proportion to In the logarithmic region, alignment scores grow in proportion to the log of the sequence lengths.the log of the sequence lengths.
In the linear region, alignment scores growth is directly In the linear region, alignment scores growth is directly proportional to sequence length. proportional to sequence length.
It is generally accepted that weight combinations that fall within It is generally accepted that weight combinations that fall within the logarithmic region are useful for detecting biologically the logarithmic region are useful for detecting biologically related sequences.related sequences.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Predicting local alignment scores Predicting local alignment scores with global alignment scoreswith global alignment scores
The rule for determining if parameter weights fall within the The rule for determining if parameter weights fall within the logarithmic region is to look at:logarithmic region is to look at:
• the expected score per aligned letter pair (ungapped the expected score per aligned letter pair (ungapped alignments)alignments)
• the expected global alignment score (gapped alignments)the expected global alignment score (gapped alignments)
For composition alignment, these rules do not apply. For composition alignment, these rules do not apply.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Global score as a predictor of Global score as a predictor of local parameter suitabilitylocal parameter suitability
Average Global Composition Alignment Scores: DNA SequencesFunction 1
-400
-350
-300
-250
-200
-150
-100
-50
0
50
100
100 200 300 400 500 600 700 800 900
Sequence Length
Sc
ore
Limit = 2
Limit = 3
Limit = 4
Limit = 5
Scoring function: cm(k) = ck; characters generated iid p = 0.25; Scoring function: cm(k) = ck; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Growth of local alignment scoreGrowth of local alignment scoreAverage Local Composition Alignment Scores: DNA Sequences
Function 1
0
20
40
60
80
100
120
100 1000
Sequence Length
Sc
ore
Limit = 2
Limit = 3
Limit = 4
200 400 800
Scoring function: cm(k) = ck; characters generated iid p = 0.25; Scoring function: cm(k) = ck; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Global score as a predictor of Global score as a predictor of local parameter suitabilitylocal parameter suitability
Global Composition Alignment Scores: DNA SequencesFunction 2
-200
-180
-160
-140
-120
-100
-80
-60
-40
-20
0
0 100 200 300 400 500 600 700 800 900
Sequence Length
Sc
ore
10
20
30
50
Scoring function: cm(k) = cScoring function: cm(k) = c√√k; characters generated iid p = 0.25; k; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Growth of local alignment score Growth of local alignment score
Average Local Composition Alignment Scores: DNA SequencesFunction 2
0
10
20
30
40
50
60
70
80
90
100
100 1000
Sequence Length
Sc
ore
50
30
20
10
6
200 400 800
Scoring function: cm(k) = cScoring function: cm(k) = c√√k; characters generated iid p = 0.25; k; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Limit values for DNA Limit values for DNA
• Function 1: cm(k) = ck: Function 1: cm(k) = ck: Limit Limit ≤ 3≤ 3..
• Function 2: cm(k) = c√k: Function 2: cm(k) = c√k: Limit ≤ 10Limit ≤ 10..
• Function 3: cm(C, B, k) = ck ·H(C, B): Function 3: cm(C, B, k) = ck ·H(C, B):
Limit ≤ 50Limit ≤ 50..
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Biological examplesBiological examples
Composition alignment was tested on a set of 1796 Composition alignment was tested on a set of 1796 promoter promoter sequencessequences from the Eukaryotic Promoter Database. Each from the Eukaryotic Promoter Database. Each sequence is sequence is 600 nucleotides long600 nucleotides long, 500 bases upstream and 100 , 500 bases upstream and 100 downstream of the transcription initiation site.downstream of the transcription initiation site.
Two local alignment scores were produced using function 1:Two local alignment scores were produced using function 1:
• WW using composition alignment using composition alignment
• SS using standard alignment. using standard alignment.
The examples shown have The examples shown have statistically significant Wstatistically significant W with with W W ≥ 3 · ≥ 3 · SS to exclude good standard alignments. to exclude good standard alignments.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Example 1Example 1
Composition alignment and standard alignment of the same Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically two promoters. Standard alignment is not statistically significant. Sequences are characteristic of significant. Sequences are characteristic of CpG islandsCpG islands..
Composition Alignment:Composition Alignment:
GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGCGCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGCCCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC
Standard Alignment:Standard Alignment:
CGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCG
Two genes, Vdac and Bcl2. Vdac forms a channel through Two genes, Vdac and Bcl2. Vdac forms a channel through mitochondrial membranes for small molecules. Bcl1 regulates cell mitochondrial membranes for small molecules. Bcl1 regulates cell death by controlling mitochondrial membrane permeability.death by controlling mitochondrial membrane permeability.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Example 2Example 2
Composition alignment of two promoter sequences. Composition alignment of two promoter sequences. Composition changes at vertical line.Composition changes at vertical line. A C G TA C G T
Left:Left: (0.01, 0.61, 0.30, 0.08) (0.01, 0.61, 0.30, 0.08) Right:Right: (0.19, 0.16, 0.56, 0.09)(0.19, 0.16, 0.56, 0.09)
GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAGGCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGGCCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG
Two genes, EP73298, EP11149. Function of these genes is not known.Two genes, EP73298, EP11149. Function of these genes is not known.
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
ConclusionConclusion
We We
• define a new alignment problem based on composition define a new alignment problem based on composition matching and test several scoring functions matching and test several scoring functions
• show how to find all-pairs shortest composition match show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabetlengths in linear time per pair for a fixed alphabet
• show that alignment using scoring functions based on show that alignment using scoring functions based on sequence length only require finding shortest composition sequence length only require finding shortest composition matchesmatches
• give biological examples where composition alignment finds give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity statistically (and functionally) significant sequence similarity in the absence of significant standard alignmentsin the absence of significant standard alignments
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
Composition PatternsComposition Patterns
Goal: Goal:
Identify features in sequences that are defined by Identify features in sequences that are defined by character character compositioncomposition rather than position specific patterns. rather than position specific patterns.
CompositionComposition is a vector quantity describing the frequency of is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let occurrence of each alphabet letter in a particular string. Let SS be a string over be a string over ΣΣ. Then, . Then,
C(S)=(fC(S)=(fσσ1 1 , f, fσσ2 2
, , ffσσ3 3 , … , , … , ffσσ||ΣΣ||
))
is the composition of is the composition of SS, where , where ffσσii is the fraction of the characters is the fraction of the characters
in in SS that are that are σσii. .
Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa
The limit parameterThe limit parameter
Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the the amount of matching between sequences. If amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be meaningful.alignments will not be meaningful.
The The limitlimit parameter, an upper bound on the length parameter, an upper bound on the length ll of the longest of the longest single composition match, can be used to prevent excessive single composition match, can be used to prevent excessive matching. matching.
Sequence length = 100, randomly generated Sequence length = 100, randomly generated
limitlimit 11 22 55 1010
DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5151