dna code design for dna computing - icoict...

75
University of Lugano University of Applied Sciences of Southern Switzerland IDSIA Dalle Molle Institute for Artificial Intelligence Optimization approaches for the design of DNA codes Roberto Montemanni [email protected]

Upload: others

Post on 27-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

University of Lugano

University of Applied Sciences of

Southern Switzerland

IDSIA

Dalle Molle Institute for

Artificial Intelligence

Optimization approaches for

the design of DNA codes

Roberto [email protected]

Page 2: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

2

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 3: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

3

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 4: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

4

DNA – The Blueprint of Life

chimp

cow

dinosaur bird

fish

worm

bacteriahuman

DNA

9 pictures taken from ClipArt

Background: DNA

Page 5: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

5

What is DNA?

• All organisms on this planet are made of the same type of

genetic blueprint.

Page 6: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

6

Real applications

• DNA computing => using DNA for massively parallel computations

• DNA chemical libraries => for the development and test of new drugs

• DNA microarrays => for profiling genes and tracing genes within long DNA strands

• DNA nanotechnologies => for the development of new materials/devices

Page 7: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

7

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 8: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

8

DNA, Wikimedia Commons

What is DNA?

• genetic material

• four letter alphabet (nucleotides, bases):

– A (adenine),

– C (cytosine),

– G (guanine),

– T (thymine)

• complementary base pairs CG, AT

• hybridization via base pairing

A

A

C

G

T

3

5’

T

T

G

C

A

3’

5’

A

T

G

G

T

3

5’

T

T

G

C

A

3’

5’

Perfect hybridization Imperfect hybridizationBackground: DNA

Page 9: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

9

Modeling

Uniform Stability

A

A

C

G

T

3’

5’

T

T

G

C

A

3’

5’

A

A

C

G

T

5’

3’

C

A

C

C

C

3’

5’

Non-interaction

Design Goals

Desired properties

• Desired properties coming from real applications

• Notice that properties are not the same for all applications

Page 10: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

10

DNA Codes Design Problem description

Input data:

• The alphabet {A, C, G, T}

• A fixed length n for the codewords

• A required distance d among codewords (used by constraints

in Z)

•A set Z of constraints (explained in the next slides)

Optimization objective:

• Find the largest possible set of codewords (= code) of length

n on alphabet {A, C, G, T}, feasible with respect to constraints

Z (based on d)Why to maximize the size of the code? To have

more flexibility in the applications seen before!

Page 11: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

11

AATTCCGGACCTGATT

ATTCCCAG

ACCTTTTT

Codeword

Word Length n = 8

TATATATA

CATTCACC

GCTTATTC

GATTCAAT

TCACCATG

CCGTTACA

GCGCGCGC

CTATTCAC

TTGGCCAA

GGCTTTTA

CTACTACG

The solution respects a

given a constraints set Z

(we do not know Z at

this stage!)

ExampleCode (solution)

DNA Codes Design Problem description

Page 12: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

12

Requirements of a DNA Code

• Success in specific hybridization between a

DNA codeword and its complement.

• No hybridization between DNA codewords

from the same DNA code or between a DNA

codeword and others complement.

How do these requirements translate into our

constraints set Z?

DNA Codes Design Problem description

Page 13: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

13

Constraints considered (set Z):

• Requirement: the distance between two codewords must be large (no

hybridization).

• Answer: HD (Hamming Distance)

- Given two codewords w1 and w2

- H(w1, w2) = number of positions i in which the ith letter of w1

differs from the ith letter of w2

- example: w1 = GCTA, w2 = ATTA, H(w1, w2) = 2

- Constraint: H(w1, w2) ≥ d

DNA Codes Design Problem description

Page 14: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

14

Constraints considered (set Z):

• Requirement: the number of G or C of each codeword must be the

same (uniform stability) [=> self-hybridization is likely]

• Answer: GC (GC-content constraint)

- A fixed number of the letters of each word has to be

either G or C: floor(n/2) in our case

- example: ATA is not feasible, AGA is feasible

DNA Codes Design Problem description

Page 15: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

15

• Requirement: the distance between a codeword and the complement of

another codeword must be large.

Watson-Crick complement of a DNA codeword

wcc(w) = Watson-Crick complement of a DNA codeword w,

obtained by reversing w and then by replacing each A in w

by T (and vice-versa) and each C in G (and vice-versa)

- example: wcc(ATGC) = GCAT

DNA Codes Design Problem description

Page 16: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

16

Constraints considered (set Z):

•Requirement: the distance between a codeword and the complement of

another codeword must be large.

• Answer: RC (Reverse Complement Hamming distance)

- Given two codewords w1 and w2

- example: GCTA, ATGC

H(GCTA, wcc(ATGC)) = H(GCTA,GCAT) = 2

- Constraint: H(w1, wcc(w2)) ≥ d

DNA Codes Design Problem description

Page 17: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

17

Example of a problem and its solution

• Input data: n = 4, d = 3.

• Constraints considered: HD, GC, RC

• Solution:

the largest possible code with the characteristics above contains

6 codewords.

Optimal code with respect to the constraints considered (not

unique!):

CTTC GGTT GTCA

AGGA ACTG TTGG

Page 18: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

18

Problem description

• Other kinds of constraints are possible.

• They depend on the real-world application

considered

• In this mini-course we limit ourselves to the

constraints on the previous slides

Important observation

Page 19: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

19

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 20: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

20

Construction Heuristics

Construction Heuristic (CH)

All possible codewords with the required GC-content are examined in a

given order.

Codewords are incrementally accepted if feasible with respect to the

already accepted ones.

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary

constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-

4656 (2009)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Math. Modelling and

Algorithms 7, 311-326 (2008).

Smith, D.H., Hughes L.A., Perkins S. A new table of constant weight binary

codes of length grater than 28. Electron. J. of Combinatorics, 13(1), #A2 (2006).

Page 21: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

21

Construction Heuristics

Example: n = 4, d = 3.

Constraints: HD, GC, RC

Lexicographic order:

AACC AACG AAGC AAGG ACAC ACAG ACCA ACCT ACGA ACGT ACTC ACTG AGAC AGAG

AGCA AGCT AGGA AGGT AGTC AGTG ATCC ATCG ATGC ATGG CAAC CAAG CACA CACT

CAGA CAGT CATC CATG CCAA CCAT CCTA CCTT CGAA CGAT CGTA CGTT CTAC CTAG

CTCA CTCT CTGA CTGT CTTC CTTG GAAC GAAG GACA GACT GAGA GAGT GATC GATG

GCAA GCAT GCTA GCTT GGAA GGAT GGTA GGTT GTAC GTAG GTCA GTCT GTGA GTGT

GTTC GTTG TACC TACG TAGC TAGG TCAC TCAG TCCA TCCT TCGA TCGT TCTC TCTG

TGAC TGAG TGCA TGCT TGGA TGGT TGTC TGTG TTCC TTCG TTGC TTGG

Solution: AACC ACAG AGGA CCTA GTCA

Page 22: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

22

Construction Heuristics

• The method works over any possible order of the nodes

(lexicographic, reverse lexicographic, random) => different

algorithms in fact…

• Computational experiments suggest that random orders guarantee

better results on DNA code design problems

• Slow for large problems (all possible codewords have to be

examined!)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. J. of Math. Modelling and

Algorithms 7, 311-326 (2008).

Page 23: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

23

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 24: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

24

Seed Building local search

Seed Building (SB)

Iterative approach

A set of seed codewords is considered

The set of seed codewords is dynamically adapted through iterations

During each iteration:

• All possible codewords with the required GC-content are examined in a given

order.

• Codewords are incrementally accepted if feasible with those already accepted in

the current iteration and with the seed codewords.

Statistics are used to expand or contract the set of seed codewords every ItrSeed

iterations, based on the quality of the solutions built.

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. J. of Math. Modelling and

Algorithms 7, 311-326 (2008).

Brouwer A.E., Shearer J.B., Sloane N.J.A., Smith W.D. A new table of constant

weight codes. IEEE Trans. Inf. Theory 36, 1334-1380 (1990).

Page 25: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

25

Seed Building local search

Seed

codewords

management

Page 26: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

26

Seed Building local search

Example: n = 4, d = 3.

Constraints: HD, GC, RC

Seed codewords: AACC ACAG

Random order:

CTTC CTTG CTCA CTCT CTGA CTGT CTAC CTAG CATC CATG CACA CACT CAGA

CAGT CAAC CAAG CCTA CCTT CCAA CCAT CGTA CGTT CGAA CGAT GTTC GTTG

GTCA GTCT GTGA GTGT GTAC GTAG GATC GATG GACA GACT GAGA GAGT GAAC

GAAG GCTA GCTT GCAA GCAT GGTA GGTT GGAA GGAT TTCC TTCG TTGC TTGG

TACC TACG TAGC TAGG TCTC TCTG TCCA TCCT TCGA TCGT TCAC TCAG TGTC

TGTG TGCA TGCT TGGA TGGT TGAC TGAG ATCC ATCG ATGC ATGG AACC AACG

AAGC AAGG ACTC ACTG ACCA ACCT ACGA ACGT ACAC ACAG AGTC AGTG AGCA

AGCT AGGAAGGT AGAC AGAG

Solution: AACC ACAG CCTA GTCA TCCT

Page 27: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

27

Seed Building local search

• The method works over any possible order of the nodes

(lexicographic, reverse lexicographic, random).

• Experiments clearly show that a random order has to be

preferred for DNA codes design problems.

• The process of identify a good set of codewords is

intrinsically difficult => codes produced are sometimes very

good and sometimes very poor => not a very robust method

• Slow for large problems (all possible codewords are

examined at each iteration!)

Page 28: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

28

• Clique

Given an undirected graph G, a clique is a set of the vertices in

which every vertex is connected to every other vertex of the clique

• Maximal clique problemGiven an undirected graph G, identify the largest (number of nodes)

clique of G

• ComplexityClassic NP-hard problem

Clique Search local search

• {0, 3, 4} is a clique

• {2, 3, 4, 5} is a

maximal clique

Page 29: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

29

Clique Search local search

Clique Search (CS)

Iterative approach

A partial code can be completed by solving a subproblem (which is a

maximum clique problem) to optimality

During each iteration:

• All possible codewords with the required GC-content are examined in a

random order.

• Codewords are accepted for the second phase if feasible with those of the

partial code.

• A maximum clique problem is solved on the set of accepted codewords to

complete the partial code

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Math. Modelling and

Algorithms 7, 311-326 (2008).

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary

constant weight codes. IEEE Transactions on Information Theory 55(10), 4651-

4656 (2009)

Page 30: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

30

Clique Search local search

Page 31: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

31

Clique Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code: CTTC CGAA TGGT GTGA

Maximum clique problem on feasible extensions of the partial

solution:

CACT AGTG

AAGC GCTT

Page 32: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

32

Clique Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code: CTTC CGAA TGGT GTGA

Maximum clique problem on feasible extensions of the partial

solution:

CACT AGTG

AAGC GCTT

Solution: CTTC CGAA TGGT GTGA CACT GCTT

Page 33: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

33

Clique Search local search

• Solving a maximum clique problem (sub-procedure) is an NP-

hard problem itself!

• Heuristics have to be used for the maximum clique problem

=> no optimality is guarantee for the sub-problem solutions

• The choice of the number of codewords to eliminate is crucial

too many codewords eliminated => very large maximum

clique problem => high probability of having suboptimality

not enough codewords eliminated => very likely to find a

code with the same number of codewords of the original

This aspect deserves a deeper study to tackle large problems!

Page 34: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

34

Hybrid Search local search

Hybrid Search (HS)

Iterative approach

Merges the concepts of the two methods analyzed before.

A set of seed codewords is managed exactly as in Seed Building.

Seed codewords represent the partial code in the context of the Clique

Search.

A relaxed distance d' < d is introduced.

A candidate code has to be at least at distance d from the seeds, and d' from

the other candidate codes (this to keep the maximum clique problem to a

reasonable size!)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 35: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

35

Hybrid Search local search

Seed Building

Clique Search

Page 36: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

36

Hybrid Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code (seed codewords): CAAC AGAG

Maximum clique problem on feasible extensions of the partial solution (heuristic

distance d'=1 to reduce the codewords considered):

TGGT

TCTC TGTC

TTGC TAGG

TACG ATGC

ACTC

Page 37: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

37

Hybrid Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

Partial code (seed codewords): CAAC AGAG

Maximum clique problem on feasible extensions of the partial solution (heuristic

distance d'=1 to reduce the codewords considered):

TGGT

TCTC TGTC

TTGC TAGG

TACG ATGC

ACTC

Solution: CAAC AGAG TCTC TGGT TACG ATGC

Page 38: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

38

Hybrid Search local search

• Sums the advantages of Seed Building to those of Clique Search

but…

• There is the risk of summing up drawbacks instead!

• The method deserves a further detailed study for larger problems

Page 39: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

39

Experimental comparison of some of the heuristic

algorithms

Experimental settings

Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM

machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always

present, and therefore not listed), and with n = 5, d = 3, and GC content

= floor(n/2) = 2. [this funny notation comes from coding theory…]

4Cstrs

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 40: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

40

Experimental comparison of some of the heuristic

algorithms

• SB = Seed Building

• CS = Clique Search

• HS = Hybrid Search

Page 41: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

41

Experimental comparison of some of the heuristic

algorithms

• SB = Seed Building

• CS = Clique Search

• HS = Hybrid Search

Page 42: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

42

Experimental comparison of some of the heuristic

algorithms

Comments

• No clear ranking is possible among the methods considered:

Seed Building, Clique Search, and Hybrid Search

• Methods are therefore likely to represent different

neighbourhoods

Page 43: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

43

Idea

• All the methods seen until now work on the search space of

feasible solutions (we never have constraints violated…)

• What if we move into the search space of infeasible solutions?

=> we will have to minimize (i.e. bring down to zero!) a

measure of infeasibility!

• This makes it possible to develop a completely different kind

of local search!

• It is likely that the search space is visited in a different way by

such a family of algorithms…

Page 44: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

44

Iterated Greedy Search local search

Iterated Greedy Search (IGS)

Iterative approach Working on an infeasible code W, trying to make it feasible.

Measure of the infeasibility of W:

where w = floor(n/2)

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 45: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

45

Iterated Greedy Search local search

Iterated Greedy Search (IGS)

An infeasible solution is obtained by adding a random codeword to a perturbed feasible

solution

During each iteration:

• A codeword σ is selected at random and the optimal (according to Inf(W)) change of one

bit of σ is carried out.

• If Inf(W)=0, we are done, and we can add a random codeword

Page 46: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

46

Iterated Greedy Search local search

Perturbation of

the solution

Optimization

of the solution

Page 47: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

47

Iterated Greedy Search local search

Example: n = 4, d = 3. Constraints: HD, GC, RC

W Inf(W)

...

TGGT GACC CGAA TCAC CCTT 1

TGGT GACT CGAA TCAC CCTT 0

TGGT GGCA CGAA TCAC CCTT TTTG 8

TGGT GGCA CGTA TCAC CCTT TTTG 8

TGGT GGCA CGTA TCAC GCTT TTTG 7

TGGT GGCA CGTC TCAC GCTT TTTG 7

TGGT AGTG CGTC TCAC GCTT TTTG 4

TGGT AGTG CGTC TCAC GCTT TTCG 3

TGGT AGTG CTTC TCAC GCTT TTCG 0

TGGT AGTG GTAG TCAC GGTT TTCG AACT 9

TGGT AGTG GTAG TCTC GGTT TTCG AACT 9

...

Page 48: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

48

Iterated Greedy Search local search

• We change exactly one bit of a random codeword at each

iteration: more complex neighbourhoods could be considered…

• We never accept changes that make the solution worse: might be

an idea to escape from local minima

• A further investigation is deserved…

Page 49: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

49

Experimental comparison of some of the heuristic

algorithms

Experimental settings

Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM

machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always

present, and therefore not listed), and with n = 5, d = 3, and GC content

= floor(n/2) = 2. [this funny notation comes from coding theory…]

4Cstrs

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 50: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

50

Experimental comparison of some of the heuristic

algorithms

• SB = Seed Building

• CS = Clique Search

• HS = Hybrid Search

• IGS = Iterated Greedy Search

Page 51: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

51

Experimental comparison of some of the heuristic

algorithms

• SB = Seed Building

• CS = Clique Search

• HS = Hybrid Search

• IGS = Iterated Greedy Search

Page 52: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

52

Experimental comparison of some of the heuristic

algorithms

Comments

• No clear ranking is possible among the methods considered:

Seed Building, Clique Search, Hybrid Search and Iterative

Greedy Search

• Methods are likely to represent different neighbourhoods

Page 53: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

53

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 54: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

54

A VNS algorithm for DNA codes design

A primitive Variable Neighbourhood Search (VNS) algorithm is

introduced.

It iteratively runs in turns the local search algorithms (basic

ingredients) seen before.

The reference solution for local searches is always the best solution

retrieved so far.

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes via a

variable neighbourhood search algorithm. Journal of Mathematical Modelling and

Algorithms 7, 311-326 (2008).

Montemanni, R., Smith, D.H. Heuristic algorithms for constructing binary constant

weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656 (2009)

Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of

constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic

International Conference. S. Voss and M. Caserta eds., Springer (to appear)

Page 55: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

55

A VNS algorithm for DNA codes design

Methods involved in

our implementation

Page 56: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

56

A VNS algorithm for DNA codes design

• We hope to take advantage of the different philosophies behind the

local search methods listed before

• From previous experiments we know that the basic local searches

visit the search space is a different way

• We hope basic local searches will help each other to exit from

local minima within a VNS framework

Page 57: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

57

Experimental comparison of some of the heuristic

algorithms

Experimental settings

Methods coded in ANSI C

Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM

machines

Maximum computation times: 10'000 seconds (2.8 hours)

Statistics over 5 runs for each combination problem/method

A (5,3,2) identifies the problem with constraints Cstrs (HD is always

present, and therefore not listed), and with n = 5, d = 3, and GC content

= floor(n/2) = 2. [this funny notation comes from coding theory…]

4Cstrs

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 58: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

58

Experimental comparison of some of the heuristic

algorithms

• SB = Seed Building

• CS = Clique Search

• HS = Hybrid Search

• IGS = Iterated Greedy Search

• VNS = Variable Neighbourhood

Search

Page 59: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

59

Experimental comparison of some of the heuristic

algorithms

• SB = Seed Building

• CS = Clique Search

• HS = Hybrid Search

• IGS = Iterated Greedy Search

• VNS = Variable Neighbourhood

Search

Page 60: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

60

Experimental comparison of some of the heuristic

algorithms

Comments

• No clear ranking is possible among the basic methods considered:

Seed Building, Clique Search, Hybrid Search and Iterative Greedy

Search (as seen before…)

Methods are likely to represent different neighbourhoods

• Variable Neighbourhood Search clearly dominates the other

methods

VNS takes advantage of the different neighbourhoods

VNS is likely to be competitive against all the other methods!

Page 61: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

61Reference algorithm

Experimental results of VNS

The VNS algorithm discussed in:

• Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a

Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and

Algorithms, 7, 311-326.

is compared with the methods discussed in the following 6 papers [which provide all the best

known codes]:

• Li, M., Lee, H. J., Condon, A. E., and Corn, R. M. (2002). DNA word design strategy for

creating sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812.

• Tulpan, D. C., Hoos, H. H., and Condon, A. E. (2002). Stochastic local search algorithms for

DNA word design. Lectures Notes in Computer Science, Springer, 2568, 229-241.

• Tulpan, D. C. and Hoos, H. H. (2003). Hybrid randomised neighbourhoods improve

stochastic local search for DNA code design. Lectures Notes in Computer Science, Springer,

2671, 418-433.

• King, O. D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of

Combinatorics, 10, #R33.

• Gaborit, P. and King, O. D. (2005). Linear construction for DNA codes. Theoretical

Computer Science, 334, 99-113.

• Chee, Y. M. and Ling, S. (2008). Improved lower bounds for constant GC-content DNA

codes. IEEE Transactions on Information Theory, 54(1), 391-394.

Theor. Constructions Heuristic Algorithms

Page 62: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

62

Experimental results of VNS

Experimental settings

• Methods coded in ANSI C

• Experiments on Dual AMD Opteron 250 2.4GHz / 4GB RAM

machines

• Maximum computation times: 100'000 seconds (27.8 hours)

=> Comparable with that of other heuristic algorithms

• Best over 5 runs for each combination problem/method

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 63: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

63

• We will consider 254 problems with

- 4 ≤ n ≤ 20

- 3 ≤ d ≤ n ≤ 20

- Case 1: HD and GC constraints

- Case 2: HD, RC and GC constraints

• These settings matches those of the state-of-the-art tables

maintained at http://llama.med.harvard.edu/~king/dnacodes.html by O.D.

King (last checked November 2009)

• We left out problems corresponding to very large codes (the

current VNS algorithm cannot tackle them)

Experimental results of VNS

Page 64: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

64

• over 254 problems considered:

• in 128 cases the best known result is matched

• in 52 cases a new best result is found

Experimental results of VNS

Montemanni R., Smith D.H. Construction of constant GC-content DNA codes

via a variable neighbourhood search algorithm. Journal of Mathematical

Modelling and Algorithms 7, 311-326 (2008).

Page 65: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

65

Detailed results of VNS

Page 66: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

66

Detailed results of VNS

Page 67: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

67

Detailed results of VNS

Page 68: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

68

Detailed results of VNS

Page 69: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

69

• After the publication of the paper we have been improving

the VNS algorithms in many ways (work still in progress!)

• over 254 problems considered:

• in 128 132 cases the best known result is matched

• in 52 87 cases a new best result is found

• We miss the best known solution in 13.8% of the cases only!

• We feel there is room for further improvements…

Experimental results of VNS

Montemanni, R., Smith D.H. Metaheuristics for the construction of constant GC-

content DNA codes. Proceedings of the MIC 2009 Conference (2009)

Montemanni, R., Smith, D.H., Koul, N. Three metaheuristics for the construction of

constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic

International Conference. S. Voss and M. Caserta eds., Springer (to appear)

Page 70: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

70

Detailed results of VNS

Comments

• VNS works (slightly) better on problems with RC contraints

• Result confirmed also by our last improved implementations

• Is this because the other methods are more competitive

without RC constraints?

YES => we might have not too much chances to improve

on problems without RC constraints

NO => we probably have chances to improve on problems

without RC constraints

=> Worth to be investigated!

Page 71: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

71

Outline

• Introduction

• The DNA Codes Design problem

• Construction heuristics

• Simple local searches

• A Variable Neighbourhood Search Metaheuristic

• Bibliography

Page 72: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

72

Essential bibliography (1/4)

[HEUR] => Heuristics related publication.

Brenner, S., Lerner, R.A. (1992). Encoded combinatorial chemistry. Proceedings of the

National Academy of Science USA, 89, 5381-5383.

Adleman, L. (1994) Molecular computation of solutions to combinatorial problems. Science,

266, 1021-1024.

Frutos, A.G., Liu, Q., Thiel, A.J., Sanner, A.M.W., Condon, A.E., Smith, L.M., Corn, R.M.

(1997). Demonstration of a word design strategy for DNA computing on surfaces. Nucleic

Acids Research, 25, 4748-4757.

Hansen, P., Mladenovic, N. (2001). Variable neighbourhood search: principles and

applications. European Journal of Operational Research, 130, 449-467. [HEUR]

Marathe, A., Condon, A.E., Corn, R.M.. (2001). On combinatorial DNA word design.

Journal of Computational Biology, 8, 201-219.

Arita, M., Kobayashi, S. (2002). DNA sequence design using templates. New Generation

Computing, 20, 263-277.

Page 73: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

73

Essential bibliography (2/4)

Li, M., Lee, H.J., Condon, A.E., Corn, R.M. (2002). DNA word design strategy for creating

sets of non-interacting oligonucleotides for DNA microarrays. Langmuir, 18, 805-812.

Tulpan, D.C., Hoos, H.H., Condon, A.E. (2002). Stochastic local search algorithms for DNA

word design. Lectures Notes in Computer Science, Springer, Berlin, 2568, 229-241.

[HEUR]

Tulpan, D.C. Hoos, H.H. (2003). Hybrid randomised neighbourhoods improve stochastic

local search for DNA code design. Lectures Notes in Computer Science, Springer, Berlin,

2671, 418-433. [HEUR]

King, O.D. (2003). Bounds for DNA codes with constant GC-content. Electronic Journal of

Combinatorics, 10, #R33. [HEUR]

Kobayashi, S., Konto, T., Arita, M. (2003). On template methods for DNA sequence design.

Lecture Notes in Computer Science, 2568, 205-214.

Hoos, H.H., Stuetzle, T. (2004). Stochastic Local Search: foundations and applications.

Morgan Kaufmann/Elsevier. [HEUR]

Page 74: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

74

Essential bibliography (3/4)

Gaborit, P., King, O.D. (2005). Linear construction for DNA codes. Theoretical Computer

Science, 334, 99-113. [HEUR]

Tulpan, D.C. (2006). Effective heuristic methods for DNA strand design. PhD thesis,

University of British Columbia. [HEUR]

King, O.D. (2006). Tables of lower bounds for DNA codes with constant GC-content.

http://llama.med.harvard.edu/~king/dnacodes.html, last checked: November 2009. [HEUR]

Chee, Y. M, Ling, S. (2008). Improved lower bounds for constant GC-content DNA codes.

IEEE Transactions on Information Theory, 54(1), 391-394. [HEUR]

Montemanni, R., Smith, D.H. (2008). Construction of constant GC-content DNA codes via a

Variable Neighbourhood Search Algorithm. Journal of Mathematical Modelling and

Algorithms, 7, 311-326. [HEUR]

Montemanni, R., Smith, D.H. (2009). Heuristic algorithms for constructing binary constant

weight codes. IEEE Transactions on Information Theory 55(10), 4651-4656. [HEUR]

Montemanni, R., Smith D.H. (2009). Metaheuristics for the construction of constant GC-

content DNA codes. Proceedings of the MIC 2009 Conference. [HEUR]

Page 75: DNA Code Design for DNA Computing - ICoICT 20162016.icoict.org/wp-content/uploads/presentation/Tutorial-Roberto-Mo… · 10 DNA Codes Design Problem description Input data: •The

75

Essential bibliography (4/4)

Montemanni, R., Smith D.H., Koul, N. (2010). Three metaheuristics for the construction of

constant GC-content DNA codes. Post-proceedings of the VIII Metaheuristic International

Conference. S. Voss and M. Caserta eds., Springer. [HEUR]

Tulpan, D., Montemanni, R., Ghiggi, A. (2010). Computational Sequence Design

Techniques for DNA Microarray Technologies. Submitted for publication. [HEUR]

Ghiggi, A. (2010). DNA strand design with thermodynamic constraints. Master thesis, USI.

[HEUR]

Koul, N. (2010). Heuristic Algorithms for Construction of Constant GC content DNA codes.

Master thesis, USI. [HEUR]

Neelakandan, I. (2010). New Approaches for Constructing Constant Weight Binary Codes.

Master thesis, USI. [HEUR]