multiple sequence composition alignment

30
Multiple Sequence Composition Alignment Name: Yip Chi Kin Date: 21-12- 2006

Upload: winda

Post on 18-Jan-2016

76 views

Category:

Documents


0 download

DESCRIPTION

Multiple Sequence Composition Alignment. Name: Yip Chi Kin Date: 21-12-2006. Studied Papers. [B03] Composition Alignment. [S98] Divide-and-conquer Alignment. [M99] DIALIGN Algorithm. [SMS03] DCA + Segment-based. Main Aspects. ․Dynamic Programming - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Sequence Composition Alignment

Multiple Sequence

Composition Alignment

Name: Yip Chi KinDate: 21-12-2006

Page 2: Multiple Sequence Composition Alignment

Studied Papers

[SMS03] DCA + Segment-based

[B03] Composition Alignment

[M99] DIALIGN Algorithm

[S98] Divide-and-conquer Alignment

Page 3: Multiple Sequence Composition Alignment

Main Aspects․Dynamic Programming․Composition Alignment․Meta-code MSA․Simultaneous MSA

Pairwise Library (Global & Local) Consistency & UngappedDivide-and-conquerSegment-based (Optimal scores)

Page 4: Multiple Sequence Composition Alignment

Dynamic Programming DP Matrix C T G A

CTGA

••

Dot Matrix

max, jiS),(1,1 jiji basS

dS ji ,1

dS ji 1,

Edit GraphC T G

C

T

A

matches

deletions

insertions

1,1 jiS

jiS ,1

1, jiS

),( ji bas-d

-ds(ai,bi)

Page 5: Multiple Sequence Composition Alignment

Global Alignment

- C T T C T

-

G

C

A

T

C -20-3-4-7-10

-4-3-1-2-5-8

-5-5-3-2-3-6

-6-4-4-2-1-4

-9-7-5-3-1-2

-10-8-6-4-20

Needleman-Wunsch Algorithm

GA ResultsG C A T C -- C T T C T

Scoring

2),(),( 11 jiji basbas

1),( jiji basthenbaif1),( jiji basthenbaif

jSiS ji 2,2 ,00,

Page 6: Multiple Sequence Composition Alignment

Local Alignment

Smith-Waterman Algorithm

GA Results- G A A C – G G T - -T T T A C A G G C A G

135430002220

324651000000

312343200000

212012400000

130012120000

021102020000

200220000000

000000000000

- T T T A C A G G C A G-GAACGGT

2),(),( 11 jiji basbas

2),( jiji basthenbaif1),( jiji basthenbaif

Scoring0,00, ji SS

Page 7: Multiple Sequence Composition Alignment

MSA Methods․Consistency-based․Exact method․Progressive method․Iterative method․Stochastic method․Hidden Markov method

Page 8: Multiple Sequence Composition Alignment

MSA Concepts

C - G T C TC T G T C C

C - - T G T C CC G A T A T - T

C G - - - T C TC G A T A T - T

PSAs

Trace formulation

C

T

C

C

T

T

G

G

T

C

C T

T

A

C

TT

T

A

G

G

C

CT

G

A

T

T

C

T

C

A

T

G

C

C

C

T

C

C

T

T

G

G

T

C

C

Latter formulation

T

G

A

T

T

C

T

C

A

T

G

C

C T

G

A

T

T

T

T

C

A

C

GC

Consistency-based method

Page 9: Multiple Sequence Composition Alignment

MSA Results

Aligned regions

C

T

C

C

T

T

G

G

T

C

C T

G

A

T

T

C

T

C

A

T

G

C

C T

G

A

T

T

T

T

C

A

C

GC

T

T

C

C

T

T

G

G

C

C

T

G

A

T

T

C

T

C

A

T

G

C

C

Results of MSA

G T CCT--C

G T TC---C

A T T-TAGC

G T TC---C

Unrealized Consistent

Realized

Page 10: Multiple Sequence Composition Alignment

Divide-and-conquer

S1

S2

S3

C1

C2

C3

S1C1

S2C2

S3C3

C1S1

C2S2

C3S3

Divide

Divide Divide

Align optimally

Concatenate

SuffixPrefix

Page 11: Multiple Sequence Composition Alignment

DP Distance

3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 410 8 6 4 2 0 212 10 8 6 4 2 0

C T A T A C -

GTATC-

0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 510 8 7 5 3 2 3

- C T A T A C

-GTATC

Wopt (prefix)

Wopt (suffix)CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)

Sequence: GTTCATGCCAGGTGTAAATC

SuffixPrefix

Page 12: Multiple Sequence Composition Alignment

Additional-cost

CS1,S2[2,2] = 1 + 2 – 3 = 0

= Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC]

CS1,S2[4,3] = 3 + 1 – 3 = 1

= Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]

0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 1211 8 4 0 1 4 815 12 8 4 0 0 419 15 12 8 4 1 0

C T A T A C

GTATC

CS1,S2[1,1] = 0

CS1,S2[2,2] = 0

CS1,S2[3,3] = 0

CS1,S2[4,4] = 0

CS1,S2[5,4] = 0

CS1,S2[6,5] = 0

Cost of Diagonal

Page 13: Multiple Sequence Composition Alignment

Space & Time‘Chain’ of boxes

along Diagonal in order to reduce searching time

Full sequence searching

Page 14: Multiple Sequence Composition Alignment

DIALIGN

y I A - V L F - A E d

- L A c V I F - G s -

p w d d V T F d A E -

GA ResultsConsistent diagonals

Non-Consistent (Simultaneous)

Non-Consistent (Cross over)

I A V L F A E D

L A C V I F G S

P W D D V T D A EF

Y

I A V L F A E D

L A C V I F G S

P W D D V T D A EF

Y I A V L F A E D

L A C V I F G S

P W D D V T D A EF

Y

Page 15: Multiple Sequence Composition Alignment

WeightingDiagonal Weightsw(D) = – log P(lD, SD)

where SD is sum of similarity values of same diagonal lD

lD is length of diagonal D

Overlap weighting Y I A V L F A Y D D

L A C V I F G S

S W D D V M F Y A E

Y I A V L F A Y D D

L A C V I F G S

S W D D V M F Y A E

Y I A V L F A Y D D

L A C V I F G S

S W D D V M F Y A E

Diagonals D1 , D4 and D5

Score = 1.9 + 2.6 + 0.2 = 4.7

Diagonals D1 , D2 , D3 and D5

Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3

w(D1) = 1.9

w(D2) = 1.7

w(D3) = 1.5

w(D4) = 2.6

w(D5) = 0.2

Page 16: Multiple Sequence Composition Alignment

Consistency check1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

S1

S2

S3

f2

f1

f3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

S1

S2

S3

Overlap weights

Fragments checking

Transitivity frontier [1,9]

Page 17: Multiple Sequence Composition Alignment

GreedyStrategy

Greedy ApproachTandem duplications

Consistency conflicts

M2

M1 (2)

M1 (1)

M1 (2)

M1

(1)

M2(1)

M2

(2)

M1 (2)

M1

(1)

M2

M3

M1 (2)

M1

(1)

M2

M3

S1

S2

S3

S1

S2

S3

S1

S2

S1

S2

Page 18: Multiple Sequence Composition Alignment

Composition Alignment Single

character match

Composition matches

A A C G T C T T T G A G C T C

A G C C T G A C T - G C C T A

0 1 0 1 1 1 0 1 0 0 0 1 0 0 00 1 0 0 0 1 1 0 1 1 1 0 1 1 1

+ + – + – – – + – – –0 0 0 1 2 2 1 2 1 0 -1 0 -1 -2 -3

CM of Prefix Length

Matchin

g Prefix length

Sequence #1Sequence #2

Page 19: Multiple Sequence Composition Alignment

Match Length

Prefix length

CompositionMatching

2

1

–1

–2

0

–3

94 15

3

2

2 2

2

7

Replaced by 7

Replaced by 2

111010001001101110

1110100 010011011 10

Replaced by

Page 20: Multiple Sequence Composition Alignment

Composition Matching

0 1 0 1 1 1 0 1 0 0 0 1 0 0 0

0 1 0 0 0 1 1 0 1 1 1 0 1 1 1

CM of Prefix Length (Total=9)Sequence

#1

Sequence #2

0 1 0 1 1 1 0 1 0 0 0 1 0 0 0

0 1 0 0 0 1 1 0 1 1 1 0 1 1 1

Sequence #1

Sequence #2

0 1 0 1 1 1 0 1 0 0 0 1 0 0 0

0 1 0 0 0 1 1 0 1 1 1 0 1 1 1

Sequence #1

Sequence #2

CM = 2

CM = 1

CM = 0

CM = -1

Page 21: Multiple Sequence Composition Alignment

Meta-Code

Code about code

Meta-CodeOriginal Code

Mismatch code

Matchcode

Code forTesting

Input code

Control Rule

Code ReservoirMismatch

code

Page 22: Multiple Sequence Composition Alignment

Reservoir Codes

Code ‘CT’

Store code in Reservoir S2

Code from S1

Code from S2

Store code in Reservoir S1

If both Codes founded from Reservoir S1

and Reservoir S2 delete this

two codes

Reservoir Code (e.g. AGRCT)

Code ‘G’

Code ‘G’

Code ‘C’

Code ‘C’

Code ‘AG’

Code ‘A’ in S1

Code ‘T’ in S2

Page 23: Multiple Sequence Composition Alignment

Meta-Code Rule

Meta-code (e.g. AMT)Codes

from S1 and S2

Copy the codes fromS1 and S2, p = p –1,output meta-code.

If CM length is valid,reservoir code = r,

Position = p.

Values ofr and p

Value of r

If reservoir code = r,then stop the looping

Looping for creating

meta-code

Page 24: Multiple Sequence Composition Alignment

CM (Lengths & Codes)

A A A A AG

A G GA

GA

AG

AG

T T T T TC

C C CT

CT

TT

CC

Reservoir codes in S1

S1: S2:

MetaCode

Length

T A C T C G G A CT T C G C C A T C

0 1 1 1 1 2 1 2 2R ART ART ART ART AGRCT GRC GARTC GARTC

GT

2AGRTT

TC

2AGRCC

CG

1ARC

Reservoir codes in S2

Composition Matching of S1 and S2 in prefix length

Page 25: Multiple Sequence Composition Alignment

CM of Metacode

Prefix length

CompositionMatching

0

2

1

–1

106 12

2 4GARTCART ART AGRCT

ART ARCART

AGRTT GARTC

Invalid length

Invalid length

2

Page 26: Multiple Sequence Composition Alignment

Composition MSA

T A C G T C G T C G A CT T C T G C C C G A T CT T C TMG GMT C C CMT GMC AMG TMA CT T C G T C C T C G A C

Composition matching

Meta-code MSA

New

S1

S2

S2

| | | |T T C T G C C C G A T C

T A C G T C G T C G A C

Page 27: Multiple Sequence Composition Alignment

Fixed SegmentA = Currency / CardsB = Stock / Structured P.C = Unit Trusts / BondsD = Insurance / Finance

Code catalogue

E = Mortgages / Loans

Week #1 Week #2

Branch bank #1 …Branch bank #2 …Branch bank #3 …

A C B C E B A A E B C A

B E B A A A C E D B E A

A C A A B C E E D B E E

Time Granularities

…1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2

․Weekly behaviourSegment Length LS =

5

․Semi-global alignment․Least overlap problem․Simple segmentation․Composition alignment

Page 28: Multiple Sequence Composition Alignment

Family Classifications

Fixed-Segment Composition MSA

Branch bank #1Branch bank #2Branch bank #3

C C B C A D

A A B A D C

A B A A B BCompositionalignment

Family

Group

Meta-Code Branch bank #1Branch bank #2

Meta-Code Branch bank #3

C C B A D C

A A B A D C

A A B A B BPSA

Family

Group

Page 29: Multiple Sequence Composition Alignment

Further Problems

․Fixed-segment length․Prior sequence choice․Speed-up PSAs․Nos. of Segments/Codes

Meta-Code Composition MSA

Page 30: Multiple Sequence Composition Alignment

Conclusions․Fixed-segment Composition (Least Overlap Problems)

․Meta-code Approach (Easier Transform Applications)

․Widespread use of MSA (Simultaneous Multiple Sequences)