stringology 2004 cri, haifa composition alignment gary benson departments of computer science and...

Stringology 2004 CRI, HaifaStringology 2004 CRI, Haifa

Composition AlignmentComposition Alignment

Gary BensonGary BensonDepartments of Computer Science and BiologyDepartments of Computer Science and Biology

Boston UniversityBoston University


Outline of TalkOutline of Talk

1.1. Position Specific PatternsPosition Specific Patterns

2.2. Composition PatternsComposition Patterns

3.3. Composition AlignmentComposition Alignment

4.4. Composition match scoring functionsComposition match scoring functions

5.5. Limiting the length of a composition matchLimiting the length of a composition match

6.6. Growth of local composition alignment scores Growth of local composition alignment scores

7.7. Biological examplesBiological examples


Position Specific PatternsPosition Specific Patterns

A position specific pattern, P, has the form:A position specific pattern, P, has the form:

P = pP = p1 1 pp2 2 pp3 3 ...... ppkk

where where p pii is either: is either:

• a single character a single character

• a choice of charactersa choice of characters

• a weighted choice of characters a weighted choice of characters


Types of Position Specific PatternsTypes of Position Specific Patterns

Pattern String:Pattern String: A C T T G C T A C T T G C T

Prosite Type Pattern:Prosite Type Pattern: R K L R K L [ W | S ][ W | S ] →→ R K L R K L WW

Don’t care Characters:Don’t care Characters: K C K C . .. . W W .. T T →→ K C K C R SR S W W LL T T

Regular Expression:Regular Expression: T T T*T* { A | C }*{ A | C }* G G →→ T T T T T TT T T T A C C C A A C C C A A AA A G G


Weighted PatternWeighted Pattern

Profile:Profile:

Consensus:Consensus:

11 22 33 44 55 66 77 88

AA 1212 44 00 11 00 11 66 00

CC 33 77 00 1111 33 55 00 11

GG 00 22 1616 44 00 44 1010 44

TT 11 33 00 11 1313 55 00 1111

AA CC GG CC TT CC GG TT


Patterns Based on CompositionPatterns Based on Composition

CompositionComposition is a vector quantity describing the occurrence of each is a vector quantity describing the occurrence of each alphabet letter in a particular string. Let alphabet letter in a particular string. Let SS be a string over be a string over ΣΣ. . Then, Then,

C(S)=(c C(S)=(c σσ1 1 , c, cσσ2 2

, , ccσσ3 3 , … , , … , ccσσ||ΣΣ||

) )

is the composition of is the composition of SS, where , where ccσσii is the count of characters in is the count of characters in SS

that are that are σσii and and ΣΣ c cσσii = | S | = | S |. .

Note that the Note that the orderorder of letters in a string of letters in a string is irrelevantis irrelevant when when describing the composition.describing the composition.


Composition ExampleComposition Example

S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT

C(S) = ( 3, 5, 4, 6 )C(S) = ( 3, 5, 4, 6 )

A C G TA C G T

An alternate description uses frequencies rather than counts:An alternate description uses frequencies rather than counts:

C(S) = ( 0.17, 0.28, 0.22, 0.33 )C(S) = ( 0.17, 0.28, 0.22, 0.33 )


Features defined by CompositionFeatures defined by Composition

Our goal is to identify features in sequences that are Our goal is to identify features in sequences that are defined by defined by character compositioncharacter composition rather than position rather than position specific patterns. specific patterns.

In DNA there are such features.In DNA there are such features.


Composition Based Sequence Features in Composition Based Sequence Features in DNADNA

• Isochores Isochores – Multi-megabase; specifically GC-rich or GC-– Multi-megabase; specifically GC-rich or GC-poor. GC-rich isochores have greater gene density. poor. GC-rich isochores have greater gene density.

• CpG Islands CpG Islands – Several hundred nucleotides; rich in the – Several hundred nucleotides; rich in the dinucleotide CG which is underrepresented in eukaryotic dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these genomes. Methylation of the cystine (C) in these dinucleotides alters gene expression.dinucleotides alters gene expression.

• Protein binding regionsProtein binding regions – Tens of nucleotides; dinucleotide – Tens of nucleotides; dinucleotide composition contributes to DNA flexibility, allowing the composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.helix to change shape during protein binding.


Composition MatchComposition Match

We hope to identify composition based features in sequences We hope to identify composition based features in sequences using an alignment algorithm which includes composition using an alignment algorithm which includes composition matching. matching.

Two strings, Two strings, SS and and TT, have a , have a composition matchcomposition match if their if their

lengths are equal and lengths are equal and C(S) = C(T)C(S) = C(T). .



For example, For example, SS and and TT below have a composition match below have a composition match because they each have:because they each have:

33 A A

S = S = AACTGTCTGTAACCTGGCGCTCCTGGCGCTAATTTT

T = T = AAAAAACCCCCGGGGTTTTTTCCCCCGGGGTTTTTT




4 4 CC

S = AS = ACCTGTATGTACCCCTGGTGGCCGGCCTATTTATT

T = AAAT = AAACCCCCCCCCCGGGGTTTTTTGGGGTTTTTT




44 GG

S = ACTS = ACTGGTACCTTACCTGGGGCCGGCTATTCTATT

T = AAACCCCCT = AAACCCCCGGGGGGGGTTTTTTTTTTTT




6 6 TT

S = ACS = ACTTGGTTACCACCTTGGCGCGGCGCTTAATTTT

T = AAACCCCCGGGGT = AAACCCCCGGGGTTTTTTTTTTTT


Composition Alignment ProblemComposition Alignment Problem

GivenGiven:: Sequences Sequences SS and and TT over an alphabet over an alphabet ΣΣ

and a and a composition matchcomposition match scoring function scoring function cm(s, t)cm(s, t) for for any pair of substrings any pair of substrings ss and and tt..

Find:Find: The best scoring alignment of The best scoring alignment of SS with with T T (global or (global or

local)local) where alignments include: where alignments include:

1.1. composition matchcomposition match between substrings of between substrings of SS and and T,T,2.2. single character match, single character match,

3.3. single character mismatch, single character mismatch,

4.4. insertion and deletion.insertion and deletion.


Example of composition alignmentExample of composition alignment

S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC

T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA

AlignmentAlignment

AACAACGTCGTCTTTGTTTGAGCTCAGCTC

| || |<-><-> | | <---><--->

AGCAGCCTGCTGACT-ACT-GCCTAGCCTA

composition matchcomposition match





AlignmentAlignment

AAAACCGTCTTGTCTTTTGAGCTCGAGCTC

|| ||<-> <-> || <---> <--->

AAGGCCCTGACCTGACTT-GCCTA-GCCTA

single character matchsingle character match





AlignmentAlignment

AAAACGTCCGTCTTTTTGAGCTCTGAGCTC

| |<-> | <--->| |<-> | <--->

AAGGCCTGCCTGACACT-GCCTAT-GCCTA

single character single character mismismatchmatch





AlignmentAlignment

AACGTCTTTAACGTCTTTGGAGCTCAGCTC

| |<-> | <--->| |<-> | <--->

AGCCTGACTAGCCTGACT--GCCTAGCCTA

insertion / deletioninsertion / deletion


Related WorkRelated Work

• Alignment allowing adjacent letter swap. Alignment allowing adjacent letter swap.

O(nm), Lowrance and Wagner (1975)O(nm), Lowrance and Wagner (1975)

• All swapped matchings of a pattern in a text. All swapped matchings of a pattern in a text.

O(nmO(nm1/31/3 log m log|log m log|ΣΣ|), Amir, Aumann, Landau, Lewenstein, |), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)Lewenstein (2000)

O(n log m log O(n log m log ||ΣΣ|), Amir, Cole, Hariharan, Lewenstein, Porat |), Amir, Cole, Hariharan, Lewenstein, Porat (2001)(2001)

• Composition namingComposition naming

O(n log m log O(n log m log ||ΣΣ|), Amir, Apostolico, Landau, Satta (2003)|), Amir, Apostolico, Landau, Satta (2003)


Composition Alignment using Composition Alignment using Dynamic ProgrammingDynamic Programming

Given two sequences, Given two sequences, SS and and TT, the best alignment of the prefix strings, the best alignment of the prefix strings

S[1, i] = sS[1, i] = s1 1 …… ssii

T[1, j] = tT[1, j] = t1 1 …… ttjj

ends in one of four ways: ends in one of four ways:

New:New:

1.1. composition match composition match (including single character match)(including single character match)

Standard:Standard:

2.2. mismatch mismatch

3.3. insertion insertion

4.4. deletiondeletion


Ways an Alignment Can EndWays an Alignment Can End

S: C G TS: C G T

T: C G AT: C G A

S: C A TS: C A T

T: C A -T: C A -

S: C A –S: C A –

T: C A AT: C A A

S: C G T A C S: C G T A C

T: C G C T AT: C G C T A

mismatchmismatch

insertion or deletioninsertion or deletion



Ways an Alignment Can EndWays an Alignment Can End

S: C G TS: C G T

T: C G AT: C G A

S: C A TS: C A T

T: C A -T: C A -

S: C A –S: C A –

T: C A AT: C A A



mismatchmismatch

insertion or deletioninsertion or deletion


Note that the suffixes will have Note that the suffixes will have

a length a length l l wherewhere

1 ≤ 1 ≤ ll ≤ min(i, j, limit) ≤ min(i, j, limit)


Recursion for Composition MatchRecursion for Composition Match

The score for a composition match in cell [i,j] of the aligment The score for a composition match in cell [i,j] of the aligment matrix, matrix, WW, is , is

ii

jj

W [ i – l, j – l W [ i – l, j – l ]] ++ cm cm (s(si – l + 1 i – l + 1 …… ssii , , ttj – l + 1 j – l + 1 …… ttj j ))



l = 3l = 3


Time ComplexityTime Complexity

Computing the Computing the optimal composition alignmentoptimal composition alignment with dynamic with dynamic programming is similar to standard alignment, except for programming is similar to standard alignment, except for the composition match scoring option. The overall time the composition match scoring option. The overall time complexity is complexity is

O(nmZ)O(nmZ)

where where ZZ is the time required per is the time required per (i, j)(i, j) pair to find the best pair to find the best

length length ll for the composition match. In the for the composition match. In the worst caseworst case, , every possible length must be tested, every possible length must be tested, Z = min (n, m)Z = min (n, m), , resulting in resulting in O(nO(n33)) time complexity. time complexity.


Computing the shortest Computing the shortest composition matchcomposition match

Our goal is to find the length, Our goal is to find the length, ll, , of the of the shortestshortest suffixes of suffixes of

strings strings S[1, i]S[1, i] and and T[1, j]T[1, j], such that they form a , such that they form a composition match. And do this for every composition match. And do this for every (i, j)(i, j) pair, in pair, in constant time per pair. (Assuming we have a fixed size constant time per pair. (Assuming we have a fixed size alphabet.)alphabet.)


(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)

ll 33 44 00 11

For example, letFor example, let

S = AS = AACGACGTCTTTGAGCTTCTTTGAGCT

T = AGCCTT = AGCCTGACGACTTGGTATTGGTA


(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)

ll 33 44 00 11


S = AACGTCTTS = AACGTCTTTGAGTGAGCTCT

T = AGCCTGACTTT = AGCCTGACTTGGTAGGTA


(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)

ll 33 44 00 11


S = AACGTCTTTGAGCTS = AACGTCTTTGAGCT

T = AGCCTGACTTGGTAT = AGCCTGACTTGGTA


(i, j)(i, j) (4, 8)(4, 8) (12, 14)(12, 14) (14, 3)(14, 3) (7, 5)(7, 5)

ll 33 44 00 11


S = AACGTCS = AACGTCTTTTGAGCTTTGAGCT

T = AGCCT = AGCCTTGACTTGGTAGACTTGGTA


Composition differenceComposition difference

Composition difference Composition difference is a vector quantity for two strings is a vector quantity for two strings SS

and and TT: :

CD(S, T) = (cCD(S, T) = (cσσ11 , … , , … , ccσσ||ΣΣ||

))

where where ccσσii is the number of times is the number of times σσii occurs in occurs in SS minus the minus the

number of times it occurs in number of times it occurs in TT. .


ExampleExample

Let Let

ΣΣ = { 0, 1}= { 0, 1}

S = 010111010001000S = 010111010001000

T = 010001101110111T = 010001101110111Then Then

CD (S, T) = (3, – 3) CD (S, T) = (3, – 3)


Computing Composition DifferenceComputing Composition DifferencePrefix-by-PrefixPrefix-by-Prefix


Key ObservationKey Observation

Two identical composition differences at prefix lengths Two identical composition differences at prefix lengths hh and and gg indicate a composition match of length indicate a composition match of length h – gh – g..

S = S = 010111010001000010111010001000

T = T = 010001101110111010001101110111

44 99


Key ObservationKey Observation

Two identical composition differences at prefix lengths Two identical composition differences at prefix lengths hh and and gg indicate a composition match of length indicate a composition match of length h – gh – g..

S = S = 010111010001000010111010001000

T = T = 010001101110111010001101110111

44 99

length = 5length = 5


Sort prefix-by-prefix composition differencesSort prefix-by-prefix composition differences

Sort on composition Sort on composition difference using difference using stablestable sort. Adjacent sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.


Time complexity for composition matchesTime complexity for composition matches

O(nmO(nmΣΣ)) to record all composition differences and sort to record all composition differences and sort (counting sort) to find (counting sort) to find shortestshortest composition match lengths composition match lengths

for for every (i, j) pairevery (i, j) pair for two strings of length for two strings of length nn and and mm. .

In our work, In our work, ΣΣ, is a small constant, is a small constant (4 for DNA, 16 for (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.Apostolico, Landau and Satta (2003) can be used.


Computing an alignment is one thing. Computing an alignment is one thing. Scoring it is another.Scoring it is another.

How do we How do we score score a composition match? We have explored:a composition match? We have explored:

Functions based on match length, Functions based on match length, kk::

• Function 1: Function 1: cm(k) = ckcm(k) = ck• Function 2: Function 2: cm(k) = ccm(k) = c√ k√ k

where where cc is a constant. is a constant. Functions based on substring composition:Functions based on substring composition:

• Function 3: Function 3: cm(C, B, k) = ck cm(C, B, k) = ck · H(C,B)· H(C,B)where where HH is the is the relative entropyrelative entropy function, function, CC is the composition is the composition of the matching substrings and of the matching substrings and BB is a is a backgroundbackground composition. composition.


Additive and sub-additive scoring functionsAdditive and sub-additive scoring functions

The functions based on length are additive or sub-additive:The functions based on length are additive or sub-additive:

cm(i + j) cm(i + j) ≤ cm(i) + cm(j)≤ cm(i) + cm(j)

Lemma:Lemma: For additive or sub-additive composition match For additive or sub-additive composition match scoring functions, scoring functions, any best scoring alignmentany best scoring alignment is equivalent is equivalent in score to an alignment which contains in score to an alignment which contains only shortest only shortest

composition matches.composition matches.

Theorem: Theorem: Composition alignment with additive or Composition alignment with additive or subadditive match scoring functions and finite alphabet subadditive match scoring functions and finite alphabet has time complexity has time complexity O(nm)O(nm)..


The limit parameterThe limit parameter

Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the amount of matching between sequences. If the amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be alignments will not be meaningful.meaningful.

The The limitlimit parameter is an parameter is an upper boundupper bound on the length on the length ll of the of the longest single composition match. It can be used to longest single composition match. It can be used to prevent excessive matchingprevent excessive matching. .


Investigation of the limit parameterInvestigation of the limit parameter

Sequence Sequence lengthlength

1 1 22 33 44 55 66 77 88 99 1010

Binary Binary (%)(%)

50.050.0 62.562.5 68.768.7 72.772.7 75.675.6 77.377.3 78.978.9 80.380.3 81.381.3 82.482.4

DNA DNA (%)(%)

25.025.0 30.030.0 32.332.3 35.335.3 37.537.5 39.739.7 40.740.7 42.442.4 43.343.3 44.244.2

Percent of characters counted as matching. Percent of characters counted as matching. Sequence length = limit.Sequence length = limit. UngappedUngapped aligned iid sequences. aligned iid sequences.


Percent of characters matching. Sequence length = 100.Percent of characters matching. Sequence length = 100.

Limit < sequence length. Limit < sequence length. Ungapped aligned iid sequences. Ungapped aligned iid sequences.

Note that at limit < 9, more than half the aligned letters are Note that at limit < 9, more than half the aligned letters are expected to form a mismatch.expected to form a mismatch.

C C TT G G GG C C T A A TT A A T

C C AA G G CC C C G G G GG G G G

Investigation of the limit parameterInvestigation of the limit parameter

limitlimit 11 22 55 99 1010

DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5050 5151


Logarithmic and linear growth Logarithmic and linear growth of alignment scoresof alignment scores

For standard local alignment, the parameter space for match and For standard local alignment, the parameter space for match and mismatch weights is divided into logarithmic and linear regions. mismatch weights is divided into logarithmic and linear regions.

In the logarithmic region, alignment scores grow in proportion to In the logarithmic region, alignment scores grow in proportion to the log of the sequence lengths.the log of the sequence lengths.

In the linear region, alignment scores growth is directly In the linear region, alignment scores growth is directly proportional to sequence length. proportional to sequence length.

It is generally accepted that weight combinations that fall within It is generally accepted that weight combinations that fall within the logarithmic region are useful for detecting biologically the logarithmic region are useful for detecting biologically related sequences.related sequences.


Predicting local alignment scores Predicting local alignment scores with global alignment scoreswith global alignment scores

The rule for determining if parameter weights fall within the The rule for determining if parameter weights fall within the logarithmic region is to look at:logarithmic region is to look at:

• the expected score per aligned letter pair (ungapped the expected score per aligned letter pair (ungapped alignments)alignments)

• the expected global alignment score (gapped alignments)the expected global alignment score (gapped alignments)

For composition alignment, these rules do not apply. For composition alignment, these rules do not apply.


Global score as a predictor of Global score as a predictor of local parameter suitabilitylocal parameter suitability

Average Global Composition Alignment Scores: DNA SequencesFunction 1

-400

-350

-300

-250

-200

-150

-100

-50

0

50

100

100 200 300 400 500 600 700 800 900

Sequence Length

Sc

ore

Limit = 2

Limit = 3

Limit = 4

Limit = 5

Scoring function: cm(k) = ck; characters generated iid p = 0.25; Scoring function: cm(k) = ck; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5


Growth of local alignment scoreGrowth of local alignment scoreAverage Local Composition Alignment Scores: DNA Sequences

Function 1

0

20

40

60

80

100

120

100 1000

Sequence Length

Sc

ore

Limit = 2

Limit = 3

Limit = 4

200 400 800

Scoring function: cm(k) = ck; characters generated iid p = 0.25; Scoring function: cm(k) = ck; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5


Global score as a predictor of Global score as a predictor of local parameter suitabilitylocal parameter suitability

Global Composition Alignment Scores: DNA SequencesFunction 2

-200

-180

-160

-140

-120

-100

-80

-60

-40

-20

0

0 100 200 300 400 500 600 700 800 900

Sequence Length

Sc

ore

10

20

30

50

Scoring function: cm(k) = cScoring function: cm(k) = c√√k; characters generated iid p = 0.25; k; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5


Growth of local alignment score Growth of local alignment score

Average Local Composition Alignment Scores: DNA SequencesFunction 2

0

10

20

30

40

50

60

70

80

90

100

100 1000

Sequence Length

Sc

ore

50

30

20

10

6

200 400 800

Scoring function: cm(k) = cScoring function: cm(k) = c√√k; characters generated iid p = 0.25; k; characters generated iid p = 0.25; Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5Composition match (c) = 2; match = 2; mismatch = -3; Indel = -5


Limit values for DNA Limit values for DNA

• Function 1: cm(k) = ck: Function 1: cm(k) = ck: Limit Limit ≤ 3≤ 3..

• Function 2: cm(k) = c√k: Function 2: cm(k) = c√k: Limit ≤ 10Limit ≤ 10..

• Function 3: cm(C, B, k) = ck ·H(C, B): Function 3: cm(C, B, k) = ck ·H(C, B):

Limit ≤ 50Limit ≤ 50..


Biological examplesBiological examples

Composition alignment was tested on a set of 1796 Composition alignment was tested on a set of 1796 promoter promoter sequencessequences from the Eukaryotic Promoter Database. Each from the Eukaryotic Promoter Database. Each sequence is sequence is 600 nucleotides long600 nucleotides long, 500 bases upstream and 100 , 500 bases upstream and 100 downstream of the transcription initiation site.downstream of the transcription initiation site.

Two local alignment scores were produced using function 1:Two local alignment scores were produced using function 1:

• WW using composition alignment using composition alignment

• SS using standard alignment. using standard alignment.

The examples shown have The examples shown have statistically significant Wstatistically significant W with with W W ≥ 3 · ≥ 3 · SS to exclude good standard alignments. to exclude good standard alignments.


Example 1Example 1

Composition alignment and standard alignment of the same Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically two promoters. Standard alignment is not statistically significant. Sequences are characteristic of significant. Sequences are characteristic of CpG islandsCpG islands..

Composition Alignment:Composition Alignment:

GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGCGCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGCCCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC

Standard Alignment:Standard Alignment:

CGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCG

Two genes, Vdac and Bcl2. Vdac forms a channel through Two genes, Vdac and Bcl2. Vdac forms a channel through mitochondrial membranes for small molecules. Bcl1 regulates cell mitochondrial membranes for small molecules. Bcl1 regulates cell death by controlling mitochondrial membrane permeability.death by controlling mitochondrial membrane permeability.


Example 2Example 2

Composition alignment of two promoter sequences. Composition alignment of two promoter sequences. Composition changes at vertical line.Composition changes at vertical line. A C G TA C G T

Left:Left: (0.01, 0.61, 0.30, 0.08) (0.01, 0.61, 0.30, 0.08) Right:Right: (0.19, 0.16, 0.56, 0.09)(0.19, 0.16, 0.56, 0.09)

GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAGGCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGGCCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG

Two genes, EP73298, EP11149. Function of these genes is not known.Two genes, EP73298, EP11149. Function of these genes is not known.


ConclusionConclusion

We We

• define a new alignment problem based on composition define a new alignment problem based on composition matching and test several scoring functions matching and test several scoring functions

• show how to find all-pairs shortest composition match show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabetlengths in linear time per pair for a fixed alphabet

• show that alignment using scoring functions based on show that alignment using scoring functions based on sequence length only require finding shortest composition sequence length only require finding shortest composition matchesmatches

• give biological examples where composition alignment finds give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity statistically (and functionally) significant sequence similarity in the absence of significant standard alignmentsin the absence of significant standard alignments


Composition PatternsComposition Patterns

Goal: Goal:

Identify features in sequences that are defined by Identify features in sequences that are defined by character character compositioncomposition rather than position specific patterns. rather than position specific patterns.

CompositionComposition is a vector quantity describing the frequency of is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let occurrence of each alphabet letter in a particular string. Let SS be a string over be a string over ΣΣ. Then, . Then,

C(S)=(fC(S)=(fσσ1 1 , f, fσσ2 2

, , ffσσ3 3 , … , , … , ffσσ||ΣΣ||

))

is the composition of is the composition of SS, where , where ffσσii is the fraction of the characters is the fraction of the characters

in in SS that are that are σσii. .


The limit parameterThe limit parameter

Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the the amount of matching between sequences. If amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be meaningful.alignments will not be meaningful.

The The limitlimit parameter, an upper bound on the length parameter, an upper bound on the length ll of the longest of the longest single composition match, can be used to prevent excessive single composition match, can be used to prevent excessive matching. matching.

Sequence length = 100, randomly generated Sequence length = 100, randomly generated

limitlimit 11 22 55 1010

DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5151

stringology 2004 cri, haifa composition alignment gary benson departments of computer science and...

Documents