canada: designing nucleic acid sequences for nanobiotechnology applications

4
Software News and Update CANADA: Designing Nucleic Acid Sequences for Nanobiotechnology Applications UDO FELDKAMP Faculty of Chemistry, Technical University Dortmund, Otto-Hahn-Str. 6, 44221 Dortmund, Germany Received 12 February 2009; Revised 29 April 2009; Accepted 12 May 2009 DOI 10.1002/jcc.21353 Published online 15 June 2009 in Wiley InterScience (www.interscience.wiley.com). Abstract: The design of nucleic acid sequences for a highly specific and efficient hybridization is a crucial step in DNA computing and DNA-based nanotechnology applications. The CANADA package contains software tools for designing DNA sequences that meet these and other requirements, as well as for analyzing and handling sequences. CANADA is freely available, including a detailed manual and example input files, at http://ls11-www.cs.uni- dortmund.de/molcomp/downloads. © 2009 Wiley Periodicals, Inc. J Comput Chem 31: 660–663, 2010 Key words: CANADA; DNA computing; nanobiotechnology; DNA sequence design; specific hybridization Introduction Many applications in nanobiotechnology and most in vitro imple- mentations of DNA computing algorithms rely on specific and efficient hybridization of Watson-Crick-complementary oligonu- cleotides. 1–3 Because the search space is huge (there are 4 n oligonu- cleotides of length n, and 4 mn sets containing m sequences of length n), unspecific hybridizations are difficult to predict or avoid, and sev- eral additional, possibly conflicting, requirements can be posed (e.g., duplex stability restrictions, fixed subsequences, or subsequences that must not appear in any oligonucleotide, comprising the super- structure of interest), the use of computer software to find sequence sets meeting all desirable requirements is mandatory. CANADA, the computer aided nucleic acid design package, comprises software tools for designing, handling, and analyzing nucleotide sequences. Software Description The sequence design program in the CANADA package, the DNA sequence compiler (dsc), produces sets of n b -unique base sequences, in which no subsequence of length n b occurs twice, and no subse- quence that is complementary to a subsequence of length n b appears anywhere in the set. Thus, for small enough n b , the sequences in such a set are dissimilar with respect to each other and with respect to their Watson-Crick-complements, and the danger of cross hybridiza- tion is minimized. Since different parts of the same sequence also must not share such subsequences, formation of single-stranded secondary structures is also avoided. In addition, homogeneous hybridization efficiency can be enforced by restricting GC-ratio, melting temperature (T m ), or free Gibbs energy difference (G) of the hybridization reaction. T m and G are estimated with the nearest-neighbor model, for which the user may choose among sev- eral different published parameter sets, including the state-of-the-art set from the SantaLucia group. 4 Furthermore, subsequences can be preset, e.g., when inserting restriction sites or other primary structure motifs. The occurrence of such subsequences can also be prohibited. In addition, dsc can handle concatenations of sequences, e.g., when the strands are parts of larger constructs required for an in vitro appli- cation. Subsequences of length n b overlapping two (or more, in the case of sequences shorter than n b ) concatenated sequences are also regarded with respect to n b -uniqueness by dsc. It has been shown in vitro that n b -uniqueness is a reasonable concept for enforcing specific hybridization. 5, 6 The design tool enforces n b -uniqueness by arranging all possible n b -tuples in a graph, with edges between n b -tuples that may appear as consecutive overlapping subsequences. Thus, the search for n b - unique sequences becomes a search for vertex-disjoint paths through this graph. 7 Since this problem is NP-hard even to approximate, 8 and existing approximation algorithms do not take into account that for each visited vertex also the vertex with the complementary subsequence must not be used, dsc takes a simple random-driven approach. In each step, dsc randomly chooses one of the successors of the current vertex that are not yet visited in any path. If a vertex Correspondence to: U. Feldkamp; e-mail: [email protected] Contract/grant sponsors: European Union (Project NUCAN Nucleic Base Nanostructures, STREP 013775), Zentrum für Angewandte Chemische Genomik (a joint research initiative founded by the European Union and the Ministry of Innovation and Research of the state Northrhine Westfalia) and Deutsche Forschungsgemeinschaft (FE 943/1-1) © 2009 Wiley Periodicals, Inc.

Upload: udo-feldkamp

Post on 11-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Software News and Update

CANADA: Designing Nucleic Acid Sequences forNanobiotechnology Applications

UDO FELDKAMPFaculty of Chemistry, Technical University Dortmund, Otto-Hahn-Str. 6, 44221 Dortmund, Germany

Received 12 February 2009; Revised 29 April 2009; Accepted 12 May 2009DOI 10.1002/jcc.21353

Published online 15 June 2009 in Wiley InterScience (www.interscience.wiley.com).

Abstract: The design of nucleic acid sequences for a highly specific and efficient hybridization is a crucial stepin DNA computing and DNA-based nanotechnology applications. The CANADA package contains software toolsfor designing DNA sequences that meet these and other requirements, as well as for analyzing and handlingsequences. CANADA is freely available, including a detailed manual and example input files, at http://ls11-www.cs.uni-dortmund.de/molcomp/downloads.

© 2009 Wiley Periodicals, Inc. J Comput Chem 31: 660–663, 2010

Key words: CANADA; DNA computing; nanobiotechnology; DNA sequence design; specific hybridization

Introduction

Many applications in nanobiotechnology and most in vitro imple-mentations of DNA computing algorithms rely on specific andefficient hybridization of Watson-Crick-complementary oligonu-cleotides.1–3 Because the search space is huge (there are 4n oligonu-cleotides of length n, and 4mn sets containing m sequences of lengthn), unspecific hybridizations are difficult to predict or avoid, and sev-eral additional, possibly conflicting, requirements can be posed (e.g.,duplex stability restrictions, fixed subsequences, or subsequencesthat must not appear in any oligonucleotide, comprising the super-structure of interest), the use of computer software to find sequencesets meeting all desirable requirements is mandatory. CANADA,the computer aided nucleic acid design package, comprises softwaretools for designing, handling, and analyzing nucleotide sequences.

Software Description

The sequence design program in the CANADA package, the DNAsequence compiler (dsc), produces sets of nb-unique base sequences,in which no subsequence of length nb occurs twice, and no subse-quence that is complementary to a subsequence of length nb appearsanywhere in the set. Thus, for small enough nb, the sequences in sucha set are dissimilar with respect to each other and with respect totheir Watson-Crick-complements, and the danger of cross hybridiza-tion is minimized. Since different parts of the same sequence alsomust not share such subsequences, formation of single-strandedsecondary structures is also avoided. In addition, homogeneoushybridization efficiency can be enforced by restricting GC-ratio,melting temperature (Tm), or free Gibbs energy difference (�G)

of the hybridization reaction. Tm and �G are estimated with thenearest-neighbor model, for which the user may choose among sev-eral different published parameter sets, including the state-of-the-artset from the SantaLucia group.4 Furthermore, subsequences can bepreset, e.g., when inserting restriction sites or other primary structuremotifs. The occurrence of such subsequences can also be prohibited.In addition, dsc can handle concatenations of sequences, e.g., whenthe strands are parts of larger constructs required for an in vitro appli-cation. Subsequences of length nb overlapping two (or more, in thecase of sequences shorter than nb) concatenated sequences are alsoregarded with respect to nb-uniqueness by dsc. It has been shownin vitro that nb-uniqueness is a reasonable concept for enforcingspecific hybridization.5, 6

The design tool enforces nb-uniqueness by arranging all possiblenb-tuples in a graph, with edges between nb-tuples that may appearas consecutive overlapping subsequences. Thus, the search for nb-unique sequences becomes a search for vertex-disjoint paths throughthis graph.7 Since this problem is NP-hard even to approximate,8

and existing approximation algorithms do not take into accountthat for each visited vertex also the vertex with the complementarysubsequence must not be used, dsc takes a simple random-drivenapproach. In each step, dsc randomly chooses one of the successorsof the current vertex that are not yet visited in any path. If a vertex

Correspondence to: U. Feldkamp; e-mail: [email protected]

Contract/grant sponsors: European Union (Project NUCAN Nucleic BaseNanostructures, STREP 013775), Zentrum für Angewandte ChemischeGenomik (a joint research initiative founded by the European Union andthe Ministry of Innovation and Research of the state Northrhine Westfalia)and Deutsche Forschungsgemeinschaft (FE 943/1-1)

© 2009 Wiley Periodicals, Inc.

Designing Nucleic Acid Sequences 661

has no more unused successors, back tracking is applied to find analternative path. The program is based on an older version,7 but whilethe former version of dsc was limited to the generation of sequencesfor linear assemblies, one can now design oligomers for the assem-bly of arbitrary structural motifs, e.g., branched junctions,9 doublecrossover tiles,10 or 4 × 4 tiles.11 Motifs of the letter type consist-ing of sequences designed with dsc were successfully assembledin vitro.12 Particular strands can even play different roles in sucha motif, i.e., they can appear multiple times at different positionsin the motif, allowing for sequence symmetry where intended.13

When strict enforcement of nb-uniqueness seems too restrictive fora successful sequence generation, e.g., when a DNA strand can getconcatenated to several other strands in which case reuse of nb-longsubsequences may be necessary, a parameter-controlled amount oferrors, i.e., multiple occurrences of subsequences, can be allowed.

While the size of the graph, and thus worst-case runtime, growsexponentially with nb, the best-case runtime is linear in numberand length of sequences, in contrast to optimization methods likestochastic search, where a square number of square time sequencecomparisons has to be executed in each optimization step. In prac-tice, the exponential worst-case runtime does not lead to problemsin most applications. From the technical point of view, the imple-mentation allows unique subsequence lengths of nb ≤16. However,usually, it is not sensible to choose values ≥8, because this wouldallow too long stretches of identical or complementary bases. Evenfor nb = 6, dsc may run for hours in the worst case. However, every-day practice with the software suggests that such worst cases occurvery seldomly. For example, when designing the structural motifdescribed below, runs where dsc failed to find all sequences as wellas successful runs always needed less than one second on a Pentium4, 2.4 GHz, with 1 GB RAM. As demonstrated in a technical reportcurrently in preparation (Feldkamp) the time needed depends notonly on the target structure but also on parameter settings and evenon the order in which dsc generates the sequences. For the genera-tion of structural motifs for real-world applications examined there,most runs of dsc were finished within a few minutes or even withinseconds; only some rare outliers needed more time.

For specification of the nucleotide sequences and their desiredproperties and restrictions, a description language for nucleic acids(DeLaNA) has been developed as the input file format of the designtool. In DeLaNA, the user can define sequence objects and spec-ify their properties like length, GC-ratio, or fixed subsequences(Fig. 1b). For a subset of sequences sharing some or all propertyrestrictions, sequence prototypes can be defined, which are laterinstantiated to proper sequence objects. In the definition of the lat-ter, prototypical property specifications can also be overwritten,allowing for deviations from the prototype. A detailed description ofDeLaNA and several example files specifying pools of independentsequences and strand sets for different structural motifs are providedwith CANADA.

Besides the design program, CANADA also contains soft-ware tools for batch conversion of sequences to their reverse orcomplementary sequences, calculating thermodynamic propertiessuch as Tm or �G for pools of sequences, estimating dissimilar-ity of sequence pairs using several different distance measures orsimilarity of sequence pairs using global alignment, computing com-plementarity of sequence pairs with a simple dynamic programmingduplex stability prediction method, or generating sets of randomsequences. A complete list of all tools is given in Table 1.

Figure 1. The DAE-DX tile.10 (a) Sketch of the structural motif. Arrowsindicate the backbone of DNA strands, pointing towards the 3′-end, graylines symbolize basepairs (their number does not necessarily identifythe real number of basepairs in the structure). The motif consists offive strands that can be deconstructed into six sections (A–F) and theircomplements. (b) DeLaNA file describing the motif. First, the GC-ratiois restricted to values close to 50% for all sequences of type oligo.Then, the sequence objects for the six sections in (a) are specified withtheir correct length. Because they are all of type oligo, the GC-ratiorestriction applies to all six sections. The 4WJ statements define howthe sections are concatenated to form four-way junctions that serve ascrossover points in this motif. Finally, in the pool object some undesiredsubsequences, the length of unique base tuples, and some conditions andparameters for melting temperature calculation are specified.

All tools in CANADA were implemented in C++. Executablesare available for Windows. They are implemented for running in thecommand line shell in order to support usage of the tools in shellscripts or batch files. Graphical user interfaces will be constructedsoon. Also in progress are further improvements of the design andanalysis tools.

Applications

The sequence design software can be employed for a broad rangeof applications in DNA computing and nanobiotechnology. So far,

Journal of Computational Chemistry DOI 10.1002/jcc

662 Feldkamp • Vol. 31, No. 3 • Journal of Computational Chemistry

Table 1. Tools in CANADA.

Tool Short description

Sequence design tooldsc DNA sequence compiler

DNA handling toolsAlign Global alignment and duplex stabilityClean_out Removes nonbase charactersComplement Converts to Watson-Crick complementsDuplex2hairpin Links two sequences of a duplex to a hairpin loopgc Calculates GC contentnb_unique Searches for nonunique subsequencesrand_seqs Generates random sequencesReverse Converts to reverse sequencesseq_dist Calculates distances between sequence pairsseq_dist2 As seq_dist, but for all possible pairs in a poolThermo Calculates thermodynamic properties such as

melting temperature or free energyOther tools

corr_coeff Calculates linear correlation coefficientsrank_corr Calculates Spearman rank correlation coefficients

it has been used by the author for the design of seperate and con-catenated oligonucleotides for the DNA-directed immobilization ofproteins on a microarray,5, 6 4 × 4 tiles as a rather complex exampleof a structural motif12 and several unpublished structures. Designsof the first type are usually needed for DNA computing applications,where the oligonucleotides are called DNA words and may encodedata and processing information. Structural motifs are used as build-ing blocks for two-dimensional scaffolds for the nanometer-precisepositioning of proteins or other molecules, for three-dimensionalcages that may enclose drugs for improved cellular incorporation,and as structural states between which transformations may betriggered by an external stimulus.3

As a small demonstration, we will here use dsc to designoligomers for a well-known structural motif, the DAE-DX tile.10

Figure 1a shows a sketch of this motif. It consists of two doublehelices connected by two crossover points. Actually, it is formedin vitro by self-assembly of five DNA strands (colored arrows).For design purposes, it is sensible to deconstruct these strands intosix different sections (A–F) and their Watson-Crick-complements.For example, the red strand in Figure 1a consists of sections A,B, and C, the blue strand is the concatenation of the Watson-Crickcomplements of sections F and A. These sections are generated bydsc, which pays regard to how they are connected in the motif.The length and other properties of the sections are specified inthe DeLaNA input file (Figure 1b). The first statement defines atype of sequence objects called “oligo” that has a GC-content closeto 50%. Because some sections are 11 bases long, it is not pos-sible to restrict the GC-content to exactly 50%. The lines startingwith // are comments that make the file easier to read for humanusers, but which are ignored by dsc. The six lines starting withthe word “oligo” define the sequence objects that represent the sixsections A through F, specifying the proper number of nucleotidesfor each section. Because they are all instances of the “oligo”sequence type, the GC-ratio resctriction applies to all sections. The4WJ statements define which sections are connected to four-wayjunctions that here act as crossover points. Each of the four arms of a

Table 2. Sequences for the DX-DAE Motif Found by dsc.

Section Sequence

A aacggcggtttB tggaacagagC atacgcaatggD agaagttgagcE gactaagaccF actgaaagcca

junction is identified in such a statement by its strand that has its5′-end lying towards the branching point. If a sequence object (e.g.,A in the first 4WJ statement) has its 5′-end oriented away from the

Figure 2. Result of sequence design. (a) Part of the DeLaNA output file.Ellipsis indicate parts of the file removed for the sake of brevity. Thedescriptions of two sequence sections with their actual properties such asGC-ratio, melting temperature, free energy difference of hybridization,and the base sequence are shown. (b) The designed sequences arrangedin shape of the desired DX motif. The sequences’ colors correspond tothose in Figure 1.

Journal of Computational Chemistry DOI 10.1002/jcc

Designing Nucleic Acid Sequences 663

junction, its complement must be specified in the 4WJ statement.This statement tells dsc not only that, e.g., sections A and B areconcatenated in the in vitro application, and therefore uniquenessof base tuples overlapping both sections must be regarded, but alsothat sequence symmetry around the branching point must be avoidedin order to prevent the branching point from migrating.9 The poolobject defines some general conditions that apply to all sequences.Long repetitions of guanine are forbidden because they tend to formquadruplexes. The length nb of unique base tuples is set to 4. Thelast three lines define buffer conditions, method, and parameter setused to calculate melting temperatures. Here, the nearest neighbormethod with the unified parameter set from the SantaLucia group4

is chosen.The design tool dsc was run 100 times. Since the combination

of all these restrictions is quite difficult to fulfill, dsc managedonly three times to find matching sequences. Table 2 shows thesequences found in one of the successful runs. These sequences andtheir complements can now be concatenated to full oligomer strandsaccording to Figure 1a, which can then be synthesized or orderedfor in vitro self-assembly. Figure 2a shows part of the output file inDeLaNA format, describing two of the six sequence objects. Thedescription contains the base sequence found by dsc, the actual GC-content that meets the restrictions given in the input file, meltingtemperature (Tm), and free energy difference of hybridization (DG).The nucleic acid type is specified as DNA because DeLaNA can alsobe used to describe RNA molecules. How the sequences form thedesired DAE-DX tile is shown in Figure 2b.

Acknowledgments

I gratefully acknowledge Prof. Christof M. Niemeyer for hissupport.

References

1. Seeman, N. C. Nature 2003, 421, 427.2. Condon, A. Nat Rev Genet 2006, 7, 565.3. Feldkamp, U.; Niemeyer, C. M. Angew Chem Int Ed 2006, 45,

1856.4. SantaLucia, J., Jr. Proc Natl Acad Sci USA 1998, 95, 1460.5. Feldkamp, U.; Wacker, R.; Schroeder, H.; Banzhaf, W.; Niemeyer, C. M.

ChemPhysChem 2004, 5, 367.6. Feldkamp, U.; Schroeder, H.; Niemeyer, C. M. J Biomol Struct Dyn

2006, 23, 657.7. Feldkamp, U.; Rauhe, H.; Banzhaf, W. Gene Program Evolvable Mach

2003, 4, 153.8. Guruswami, V., Khanna, S., Rajaraman, R., Shepherd, B., Yannakakis,

M. J Comput Syst Sci 2003, 67, 473.9. Seeman, N. C.; Kallenbach, N. R. Biophys J 1983, 44, 201.

10. Fu, T.; Seeman, N. C. Biochemistry 1993, 32, 3211.11. Yan, H.; Park, S. H.; Finkelstein, G.; Reif, J. H.; LaBean, T. H. Science

2003, 301, 1882.12. Saccà, B.; Meyer, R.; Feldkamp, U.; Schroeder, H.; Niemeyer, C. M.

Angew Chem Int Ed 2008, 47, 2135.13. He, Y.; Tian, Y.; Chen, Y.; Deng, Z.; Ribbe, A. E.; Mao, C. Angew Chem

Int Ed 2005, 44, 6694.

Journal of Computational Chemistry DOI 10.1002/jcc