the minisatellite transformation problem: the run-length-encoding

26
The Minisatellite Transformation Problem: The Run-Length-Encoding Approach and Further Enhancements Behshad Behzadi & Jean-Marc Steyaert, Ecole Polytechnique Mohamed Abouelhoda, Cairo University Robert Giegerich, Bielefeld University

Upload: others

Post on 12-Sep-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Minisatellite Transformation Problem: The Run-Length-Encoding

The Minisatellite Transformation Problem:

The Run-Length-Encoding Approach

and Further Enhancements

Behshad Behzadi & Jean-Marc Steyaert,

Ecole Polytechnique

Mohamed Abouelhoda, Cairo University

Robert Giegerich, Bielefeld University

Page 2: The Minisatellite Transformation Problem: The Run-Length-Encoding

Biology…

Minisatellites consist of tandem arrays of

short repeat units found in genome of

most higher eukaryotes.

High degree of polymorphism at

minisatellites has applications from

forensic studies to the investigation of the

origins of modern human groups.

Page 3: The Minisatellite Transformation Problem: The Run-Length-Encoding

…Biology…

These repeats are called variants.

MVR-PCR is designed to find the variants.

As an example, MSY1 is the minisatellite

on the human Y-chromosomes. There arefive different repeats (variants) in MSY1.

Page 4: The Minisatellite Transformation Problem: The Run-Length-Encoding

Different Repeat Types (Variants) of MSY1

Map Types:

Distance between types:

Page 5: The Minisatellite Transformation Problem: The Run-Length-Encoding

Minisatellite Maps: The MSY1 Dataset

• Example Maps from the MSY1 Dataset:

DNA Sequence: … CGGCGAT CGGCGAC CGGCGAC CGGCGAC CGGAGAT…

Unit types (Alphabet): X= CGGCGAT Y= CGGCGAC Z= CGGAGAT

Minisatallite Map: XYYYZ

Page 6: The Minisatellite Transformation Problem: The Run-Length-Encoding

Evolution Mechanism of Minisatellites

The unequal crossover is a possible mechanism for

tandem duplication:

s1 s2 s3 s4

s1 s2 s3 s4

s3 s4

s3 s4

s1 s2 s3 s4

s1 s2 s3 s4

s3 s4

s3 s4s2

Page 7: The Minisatellite Transformation Problem: The Run-Length-Encoding

Evolutionary Operations

Insertion

Deletion

Mutation

Amplification (p-plication)

Contraction (p-contraction)

Page 8: The Minisatellite Transformation Problem: The Run-Length-Encoding

Examples of operations

Insertion of d

abbc abbdc

Deletion of c

abbcb abbb

Mutation of c into d

caab daab

4-plication of c

abcb abccccb

2-contraction of b

abbc abc

Page 9: The Minisatellite Transformation Problem: The Run-Length-Encoding

Cost Functions

Page 10: The Minisatellite Transformation Problem: The Run-Length-Encoding

Hypotheses

All the costs are positive.

The cost of duplications (and

contractions) is less than all other

operations.

Triangle inequality holds:

M(x,y)+M(y,z) <= M(x,z) ; M(x,x) = 0

Page 11: The Minisatellite Transformation Problem: The Run-Length-Encoding

Transformation distance

between s and t

Applying a sequence of operations on s

transforming it into t.

The cost of a transformation is the sum of

costs of its operations.

TD = Minimum cost for a possible

transformation of s into t.

Any transformation which gives this

minimum is called an optimal

transformation.

Page 12: The Minisatellite Transformation Problem: The Run-Length-Encoding

Previous Works

Bérard & Rivals (RECOMB’02)

Behzadi & Steyaert (CPM’03, JDA’04)

Behzadi & Steyaert (WABI'04)

Page 13: The Minisatellite Transformation Problem: The Run-Length-Encoding

Generation vs. Reduction

• The symbols of s which generate a

non-empty substring of t are calledgenerating symbols.

Other symbols of s are vanishingsymbols. (These symbols are eliminatedduring the transformation by a deletion orcontraction.)

The transformation of symbol x into

non-empty string s is called generation.

The transformation of a non-empty string s

into a unique symbol x is called reduction.

Page 14: The Minisatellite Transformation Problem: The Run-Length-Encoding

The Generation x zbxxyb

The optimal generation of a non-empty string s

from a symbol x can be achieved by a non-

d i ti

Page 15: The Minisatellite Transformation Problem: The Run-Length-Encoding

The schema for an optimal transformation

There exists an optimal transformation of s into t in

which all the contractions are done before all

amplifications.

Page 16: The Minisatellite Transformation Problem: The Run-Length-Encoding

Run-Length Encoding and Run

Generation

The RLE encoding of

is .

The lengths of the encoded strings with

length n and m is denoted by m' and n'.

There exists an optimal generation of a

non-empty string t from a single symbol x

in which for every run of size k > 1 in t the

k-1 right symbols of the run are generated

by duplications of the leftmost symbol of

the run

Page 17: The Minisatellite Transformation Problem: The Run-Length-Encoding

Preprocessing --> Core algorithm

Compute the generation cost of all

substrings of the target string t from any

symbol x of the alphabet: G(t)[x,i,j]

Compute the optimal generation/reduction

costs over the substrings by recurrence

using dynamic programming.

The running time is given by:

O((m'3+n'3)|Alpha|+mn'2+nm'3+mn)

Page 18: The Minisatellite Transformation Problem: The Run-Length-Encoding

A different look at Duplication History

s1 s2 s3 s4 s5 s6 s7 s8

observed

s3 s4 s6s5 s7 s8s1 s2

s3

s1

s2

s6

s4

s5

s7

s8

s3

s3 s6

Right duplication

s4s3 s6

Left duplication

Right duplication

s4s3 s5s6

s4s3 s5s6s1

Left duplication

s4s3 s5s6s1 s2

Right duplication

s4s3 s5s6s1 s2 s7

Right duplication

Right duplications4s3 s5

s6s1 s2 s7s8

Page 19: The Minisatellite Transformation Problem: The Run-Length-Encoding

Alignment of Minisatellite Maps (1)

Example of an alignment:

s1 s2 s3 s4 s5 s6 s7 s8

r1 r2 r3 r4 r5 r6

s1 s2 s3 s4 s5 s6 s8

r1 r2 r3 r4 r5 r6

s7

matches

S

R

The two maps S and RAlignment of S and R

Page 20: The Minisatellite Transformation Problem: The Run-Length-Encoding

Alignment of Minisatellite Maps (2)

s1 s2 s3 s4 s5 s6 s8

r1 r2 r3 r4 r5 r6

s7

matches

Alignment of S and R

S

R

Page 21: The Minisatellite Transformation Problem: The Run-Length-Encoding

Improved Model of Comparison

Left and Right Simultaneous Dups

Example:

:

Bérard et al., Model

S:

R:

There is no rule to allow

simultaneous left/right

duplications in S and R

Our NEW Model

S:

R:

It has less score. Because there

is a rule to allow simultaneous

left/right duplications in S and R

Page 22: The Minisatellite Transformation Problem: The Run-Length-Encoding

Algorithm Layout

Observations:

s1 s2 s3 s4 s5 s6 s8

r1 r2 r3 r4 r5 r6

s7

matches

Alignment of S and R

Therefore:

S

R

Page 23: The Minisatellite Transformation Problem: The Run-Length-Encoding

Finding an Optimal Duplication History

s3 s4 s6s5 s7 s8s1 s2

[s4..s6]

s3

s1

s2

s6

s4

s5

s7

s8

Page 24: The Minisatellite Transformation Problem: The Run-Length-Encoding

Experimental Running Times

Bérard et al.

• MSATcompare is ours

Page 25: The Minisatellite Transformation Problem: The Run-Length-Encoding

Detection of Duplication Bias in MSY1 Dataset

E1: run algorithm allowing left- and right- duplications

EL: allow only left duplications

ER: allow only right duplications

Page 26: The Minisatellite Transformation Problem: The Run-Length-Encoding