on the k -closest substring and k -consensus pattern problems

37
22/6/14 1 On the k-Closest Substring and k- Consensus Pattern Problems Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004

Upload: medge-cruz

Post on 31-Dec-2015

40 views

Category:

Documents


0 download

DESCRIPTION

On the k -Closest Substring and k -Consensus Pattern Problems. Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004. Outline. Motivation & background Our contributions A PTAS for k -Closest Substring Problem - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On the  k -Closest Substring and  k -Consensus Pattern Problems

23/4/19 1

On the k-Closest Substring and k-Consensus Pattern

ProblemsYishan Jiao, Jingyi Xu

Institute of Computing TechnologyChinese Academy of Sciences

Ming Li

University of WaterlooJuly 5, 2004

Page 2: On the  k -Closest Substring and  k -Consensus Pattern Problems

2

Outline Motivation & background Our contributions

A PTAS for k -Closest Substring Problem The NP-hardness of (2-)-approximation

of the HRC problem A PTAS for k -Consensus Pattern Problem

Conclusion

Page 3: On the  k -Closest Substring and  k -Consensus Pattern Problems

3

Motivation Given n protein sequences, find a

“conserved” region separately:

N sequences

L

L

•Red/blue regions are different conserved regions, or motifs.

•They don’t have to be exactly the same.

•They match with higher scores than other regions.

Page 4: On the  k -Closest Substring and  k -Consensus Pattern Problems

4

Focused problem k -Closest Substring Problem(k -CSS)

.clustering theof radiuscluster maximum thenumber thecall

and of clustering a },,...,,{solution thecall We

.),(minwith of substring)(closest substring

-length a is there, stringevery for such that imizingmin

length of ,...,, stringscenter find ,integer an and

,length of stringseach with },...,,{set string aGiven

21

1

21

21

kd

Skdccc

dtcdst

LSsd

Lccc k L

msssS

k

jikijj

j

k

n

A special case when k =2

Page 5: On the  k -Closest Substring and  k -Consensus Pattern Problems

5

2-KCSS

.),(min where, imizing

},...,,{,},,{ :Output

||},,...,,{ :Input

2,1

2121

21

dtcddMin

tttdcc

mssssS

ii

n

in

L2c

bstrings)regions(su blue:BT

BS

jt

1c

AS

bstrings)regions(su red:AT

L

it

),( tcd i

L

S

Page 6: On the  k -Closest Substring and  k -Consensus Pattern Problems

6

counterpart

geometric

Related work

Hamming Radius O(1)-clustering problem (O(1)-HRC):A RPTAS for Hamming Radius O(1)-clustering problem ; Doctoral dessertation,J.Jansson,2003.

Closest Substring problem

k-Closest Substring problem

Closest String problem

Hamming Radius k-clustering problem (HRC)

Geometric k-center problem

K=1 L=m

L=m

Closest Substring problem:A PTAS; M.Li et al. ,JACM 49(2):157-171,2002

Page 7: On the  k -Closest Substring and  k -Consensus Pattern Problems

7

Outline Motivation & background Our contributions

A PTAS for k -Closest Substring Problem The NP-hardness of (2- )-approximation

of the HRC problem A PTAS for k -Consensus Pattern Problem

Conclusion

Page 8: On the  k -Closest Substring and  k -Consensus Pattern Problems

8

The PTAS for k-CSS Difficulties:

How to choose n closest substrings? How to partition strings into k sets accordingly?

Method: Extend random sampling strategy in [M.Li et al. , JACM 49(2):157-171,2002] Construct h to approximate the Hamming

distance. Result:

A PTAS for O(1)- CSS.

Page 9: On the  k -Closest Substring and  k -Consensus Pattern Problems

9

P

QLP \},...,2,1{

P-Q decomposition

…… …

1it

2it

rit

R)in positions random ))(log(( PmnOPR

L positions

'1cQ

agree ,...,, wherepositions ofset the:21 riii tttQ

PP cc ||'

Page 10: On the  k -Closest Substring and  k -Consensus Pattern Problems

10

P-Q decomposition

riiiriiiriiiriii

r

r

PPQiQ

optl

iii

n

optiii

cc'tc'

dr

)td(cnl

c'Tttt

rnr rL

tttT

rdP

,...,2,1,...,2,1,...,2,11,...,2,1

21

21

|| ,|| where

,)12

11(,',1any for such that

stringcenter a and in repeats) (allowing,...,,

strings are there,2,constant any For .length

each strings of},...,,{ ofset aGiven

:2 Lemma

|| :1 Lemma

21

,...,,

Lemma2. satisfied ),...,,(,...,, get thecan we

S,in substrings L-length possible all By trying

2121

Thus

rr jjjiii tttttt

Page 11: On the  k -Closest Substring and  k -Consensus Pattern Problems

11

Random sampling strategy :

The random sampling strategy R1(R2):randomly pick O(log(mn)) positions from P1(P

2)

)'('Let :2 Lemma of definition By the 21 ccc'

agree. ,...,, wherestring the| .121 , ... , 2 ,1 rriii iiiQ tttc'

riiiPc'

, ... ,2 ,1| .2

!! optimal about the nothing know We c

????

Page 12: On the  k -Closest Substring and  k -Consensus Pattern Problems

12

Random sampling Strategy

|||),,',()',(|

.|||),,',()',(|,each for

),least (at y probabilithigh With

. of substring L-length a is |Let

:Lemma3

22222

11111

31

PRQcuhcud

PRQcuhcudUu

4(mn)1

s}u{uU

jj

jjj

jSsj

h approximate Hamming distance well.

),(||

||),(,,, cod

R

PcodR)Qch(o RQ

Page 13: On the  k -Closest Substring and  k -Consensus Pattern Problems

13

Scheme of PTAS

otherwise., '

.),,',(min),,',(min

if,

,each Partition 3.

2into

222}s of t substring L-lengthany {111}s of t substring L-lengthany {

1into

C

RQcthRQcth

'C

Ss j

.|' and |'Get

Lemma2 satisfied ),...,,(,...,, get thecan we

,in substrings L-length possible all By trying 1.

21

2121

2Thus

QQ1

jjjiii

cc

tttttt

S

rr

;|' and |'Get

,in positions random ))(log( iespossibilit all Trying .2

22 2Thus

RR1 cc

PmnO

y.accordingl'' and,each for ' substringsclosest Get 4. 21 ,TT Sst jj

Page 14: On the  k -Closest Substring and  k -Consensus Pattern Problems

14

Scheme of PTAS

ly.respective ' and 'set to([LMW02])

problem StringClosest for the method theapplyingby

rerror within problemon optimizati thesolvecan We

21 TT

dopt

5. Get final approximating center strings

.||||;''),',(|)|',(

.||||;''),',(|)|',(

;min

2222

1111

2

2

2

1

1

1

PxTtttddtxd

PxTtttddtxd

d

jjiQ

Pj

jjiQ

Pj

.)212

11( d

such that solution aGet

opt

P2P121

drr

) |,c|) = (c,x(x21

Page 15: On the  k -Closest Substring and  k -Consensus Pattern Problems

15

Sum up:

1] & 3 [Lemma

with ,on based Partition 1.

εrd|h-d|

',h)',c(cS

opt

21

[Lemma1] optimal theeapproximat .2 2121 ,cc'',cc

Strategy.

Sampling Random with theion)decomposit (

Argument ialCombinator ofn Combinatio 3.

P-Q

Page 16: On the  k -Closest Substring and  k -Consensus Pattern Problems

16

Outline Motivation & background Our contributions

A PTAS for k -Closest Substring Problem The NP-hardness of (2- )-approximation

of the HRC problem A PTAS for k -Consensus Pattern Problem

Conclusion

Page 17: On the  k -Closest Substring and  k -Consensus Pattern Problems

17

The NP-hardness of (2-)-approximation of the HRC problem

Main Ideas: Given any instance G=(V,E) of the Vertex Cover

Problem, |V|=n, |E|= m' . Construct an instance <S ,k > of the Hamming

radius k-clustering problem, which has a k-clustering with the maximum cluster radius not exceeding 2 .

if and only if G has a vertex cover with k-m' vertices.

problem HRC theofion approximat--(2problemcover vertex

Page 18: On the  k -Closest Substring and  k -Consensus Pattern Problems

18

, } v|,,{ 531 84 or H(x,y)y,xSx,yEvsssS jiijijij

Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution.

Page 19: On the  k -Closest Substring and  k -Consensus Pattern Problems

19

We can proof: Given k 2m', k-m' vertices in V can cover E ,

if and only if there is a k-clustering of S with the maximum cluster radius equal to 2.

if there is a polynomial algorithm for the Hamming radius k -clustering problem within an approximation factor less than 2

the exact vertex cover number of any instance G can be solved in polynomial time.

This is a contradiction.

Page 20: On the  k -Closest Substring and  k -Consensus Pattern Problems

20

Outline Motivation & background Our contributions

A PTAS for k -Closest Substring Problem the NP-hardness of (2- )-approximation

of the HRC problem A PTAS for k -Consensus Pattern Problem

Conclusion

Page 21: On the  k -Closest Substring and  k -Consensus Pattern Problems

21

Conclusion A nice combination of

Combinatorial argument (P-Q decomposition) with the random sampling strategy in solving k -CSS problem.

An alternative and direct proof of the NP-hardness of (2- )-approximation of the HRC problem.

Page 22: On the  k -Closest Substring and  k -Consensus Pattern Problems

22

Contact Us Authors

Yishan Jiao, Jingyi Xu : {jys,xjy}@ict.ac.cn Bioinformatics lab, Institute of Computing

Technology, Chinese Academy of Sciences Ming Li: [email protected]

University of Waterloo

n

Page 23: On the  k -Closest Substring and  k -Consensus Pattern Problems

23

Thank You!

Page 24: On the  k -Closest Substring and  k -Consensus Pattern Problems

24

Outline Motivation & background Our contributions

The PTAS for k-Closest Substring Problem the NP-hardness of (2-)-approximation of

the HRC problem The PTAS for k-Consensus Pattern

Problem Conclusion

Page 25: On the  k -Closest Substring and  k -Consensus Pattern Problems

25

Deterministic PTAS for O(1)-Consensus Pattern problem 1

k-Consensus Pattern problem

Most related works: The Hamming O(1) -median clustering problem

O(1)-Consensus Pattern problem when L= m. A RPTAS ; R. Ostrovsky et al. ,JACM 49(2):139-156,2002

The Consensus Pattern problem k-Consensus Pattern problem when k= 1. A PTAS; M.Li et al., STOC’99.

给出 O(1)-Consensus Pattern Problem 的一个确定性PTAS ,并证明。

Page 26: On the  k -Closest Substring and  k -Consensus Pattern Problems

26

DPTAS for O(1)-CP 1 Outline:

1.Suppose in the optimal solution:({c1,c2}, {t1,t2,…,tn}, {C1,C2})

C1,C2: instances of Consensus Pattern problem2.Trying all possibilities, get and satisfying Lemma 3 in M.Li et al., STOC’99.

Page 27: On the  k -Closest Substring and  k -Consensus Pattern Problems

27

DPTAS for O(1)-CP 2 Outline:

3. Get c1’,c2’ c1’: the column-wise majority string of c2’: the column-wise majority string of

4.Partition each into C1’,C2’ as follows: otherwise

5.Get closest substrings (tl’) in T1’,T2’ satisfying

Page 28: On the  k -Closest Substring and  k -Consensus Pattern Problems

28

DPTAS for O(1)-CP 3 Outline:

6.Get a good approximation solutionwhere

c1”,c2” are the column-wise majority string of all string in T1’,T2’ respectively.

7.Conclusion: Output a solution in polynomial time with total

cost at most

Page 29: On the  k -Closest Substring and  k -Consensus Pattern Problems

29

PTAS for 2-Consensus Pattern

problem

Page 30: On the  k -Closest Substring and  k -Consensus Pattern Problems

30

Definition of PTAS A family of approximation algorithms fo

r problem P,{Ak}k, is called a polynomial (time) approximation scheme or PTAS, if algorithm Ak is a (1+k)-approximation algorithm and its running time is polynomial in the size of the input for a fixed k.

Page 31: On the  k -Closest Substring and  k -Consensus Pattern Problems

31

Vertex-cover problem Vertex cover: given an undirected

graph G=(V,E), then a subset V'V such that if (u,v)E, then uV' or v V' (or both).

Size of a vertex cover: the number of vertices in it.

Vertex-cover problem: find a vertex-cover of minimal size.

Page 32: On the  k -Closest Substring and  k -Consensus Pattern Problems

32

Vertex-cover problem Vertex-cover problem is NP-complete. (See s

ection 34.5.2). Vertex-cover belongs to NP. Vertex-cover is NP-hard (CLIQUEPvertex-cover.)

Reduce <G,k> where G=<V,E> of a CLIQUE instance to <G',|V|-k> where G'=<V,E'> where E'={(u,v): u,vV, uv and <u,v>E} of a vertex-cover instance.

So find an approximate algorithm.

Page 33: On the  k -Closest Substring and  k -Consensus Pattern Problems

33

Page 34: On the  k -Closest Substring and  k -Consensus Pattern Problems

34

Conclusion for the approximation solution

Outline Get a good approximation solution

where

10.Conclusion: Outputs (c1”, c2”) in polynomial time Satisfying with high probability:

Can be derandomized by standard method [MR95]. Extend to k=O(1) case: trivial

Page 35: On the  k -Closest Substring and  k -Consensus Pattern Problems

35

PTAS for 2-CSS

Page 36: On the  k -Closest Substring and  k -Consensus Pattern Problems

36

Notation

problem stringClosest of instances :

({

:CSS-2for solution optimal in the Suppose 5.

.| and|between distance hamming ),|,|(),( .4

].[]...[][ string :| .3

)...1 (},,...,,{set multi a :set position .2

||||,||||strings, :,,, .1

21

2121

21

opt2121n2121

PpPpP

kp

kk

,TT

)},d,S}, {S,T}, {T,t,,t}, {t,cc

tstsdtsd

jsjsjss

mjjjjjj

L.comtscots

Page 37: On the  k -Closest Substring and  k -Consensus Pattern Problems

37

P-Q decomposition

……

L positions

'1c

1it

2it

rit

Q P

R

).,( min),(

).,,,(min),,,(

),,(||

||),(,,,

:measures Distance

)in positions random ))(log((

\},...,2,1{

agree. ,...,, wherepositions ofset the:

}s of t substring L-lengthany {

}s of t substring L-lengthany {

21

ctdcsg

RQcthRQcsf

codR

PcodR)Qch(o

PmnOPR

QLP

tttQ

RQ

iii r