allert2010-multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

14
Multifactorial Determinants of Protein Expression in Prokaryotic Open Reading Frames Malin Allert, J. Colin Cox and Homme W. HellingaDepartment of Biochemistry, Duke University Medical Center, Box 3711, Durham, NC 27710, USA Received 11 May 2010; received in revised form 27 July 2010; accepted 5 August 2010 Available online 18 August 2010 Edited by J. Weissman Keywords: protein expression; nucleotide composition; mRNA secondary structure; codon usage; synthetic genes A quantitative description of the relationship between protein expression levels and open reading frame (ORF) nucleotide sequences is important for understanding natural systems, designing synthetic systems, and optimiz- ing heterologous expression. Codon identity, mRNA secondary structure, and nucleotide composition within ORFs markedly influence expression levels. Bioinformatic analysis of ORF sequences in 816 bacterial genomes revealed that these features show distinct regional trends. To investigate their effects on protein expression, we designed 285 synthetic genes and determined corresponding expression levels in vitro using Escherichia coli extracts. We developed a mathematical function, parameterized using this synthetic gene data set, which enables computation of protein expression levels from ORF nucleotide sequences. In addition to its practical application in the design of heterologous expression systems, this equation provides mechanistic insight into the factors that control translation efficiency. We found that expression is strongly dependent on the presence of high AU content and low secondary structure in the ORF 5region. Choice of high-frequency codons contributes to a lesser extent. The 3terminal AU content makes modest, but detectable contributions. We present a model for the effect of these factors on the three phases of ribosomal function: initiation, elongation, and termination. © 2010 Elsevier Ltd. All rights reserved. Introduction Quantitative description of the factors that deter- mine protein expression levels is central to under- standing natural systems, 1 designing synthetic systems, 2,3 and optimizing heterologous expression. 4 Protein expression is a complex, multi-step process involving transcription, mRNA turnover, translation, post-translational processing, and protein stability. Although much of the information controlling expression levels is encoded in untranslated regions of bacterial genes, 5,6 sequence variation in open reading frames (ORFs) also can have profound effects. 79 The latter is mediated through the presence or absence of recognition sequences for stimulatory or inhibitory factors such as RNA- binding proteins 10,11 or non-coding RNAs 12 and, more generally, through variation of three features: codon identity, levels of mRNA secondary structure, and nucleotide composition. A quantitative description of the relationship between protein expression levels and ORF se- quence features has remained elusive. An average ORF in Escherichia coli potentially can adopt 10 159 iso-coding sequences (this study). Experimental exploration of such enormous sequence spaces to define the relationship between sequence and protein expression levels is very challenging. 9 Powerful computational algorithms have been developed to solve the class of huge discrete *Corresponding author. E-mail address: [email protected]. Abbreviations used: ORF, open reading frame; TnT, transcription and translation; CAI, codon adaptation index; LC, liquid chromatography; MS, mass spectrometry. doi:10.1016/j.jmb.2010.08.010 J. Mol. Biol. (2010) 402, 905918 Contents lists available at www.sciencedirect.com Journal of Molecular Biology journal homepage: http://ees.elsevier.com.jmb 0022-2836/$ - see front matter © 2010 Elsevier Ltd. All rights reserved.

Upload: j-colin-cox

Post on 14-Apr-2017

84 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

doi:10.1016/j.jmb.2010.08.010 J. Mol. Biol. (2010) 402, 905–918

Contents lists available at www.sciencedirect.com

Journal of Molecular Biologyj ourna l homepage: ht tp : / /ees.e lsev ie r.com. jmb

Multifactorial Determinants of Protein Expression inProkaryotic Open Reading Frames

Malin Allert, J. Colin Cox and Homme W. Hellinga⁎Department of Biochemistry, Duke University Medical Center, Box 3711, Durham, NC 27710, USA

Received 11 May 2010;received in revised form27 July 2010;accepted 5 August 2010Available online18 August 2010

Edited by J. Weissman

Keywords:protein expression;nucleotide composition;mRNA secondary structure;codon usage;synthetic genes

*Corresponding author. E-mail [email protected] used: ORF, open re

transcription and translation; CAI, cindex; LC, liquid chromatography; Mspectrometry.

0022-2836/$ - see front matter © 2010 E

A quantitative description of the relationship between protein expressionlevels and open reading frame (ORF) nucleotide sequences is important forunderstanding natural systems, designing synthetic systems, and optimiz-ing heterologous expression. Codon identity, mRNA secondary structure,and nucleotide composition within ORFs markedly influence expressionlevels. Bioinformatic analysis of ORF sequences in 816 bacterial genomesrevealed that these features show distinct regional trends. To investigatetheir effects on protein expression, we designed 285 synthetic genes anddetermined corresponding expression levels in vitro using Escherichia coliextracts. We developed a mathematical function, parameterized using thissynthetic gene data set, which enables computation of protein expressionlevels from ORF nucleotide sequences. In addition to its practicalapplication in the design of heterologous expression systems, this equationprovides mechanistic insight into the factors that control translationefficiency. We found that expression is strongly dependent on the presenceof high AU content and low secondary structure in the ORF 5′ region.Choice of high-frequency codons contributes to a lesser extent. The 3′terminal AU content makes modest, but detectable contributions. Wepresent a model for the effect of these factors on the three phases ofribosomal function: initiation, elongation, and termination.

© 2010 Elsevier Ltd. All rights reserved.

Introduction

Quantitative description of the factors that deter-mine protein expression levels is central to under-standing natural systems,1 designing syntheticsystems,2,3 and optimizing heterologous expression.4

Protein expression is a complex, multi-step processinvolving transcription, mRNA turnover, translation,post-translational processing, and protein stability.Although much of the information controllingexpression levels is encoded in untranslated regions

ess:

ading frame; TnT,odon adaptationS, mass

lsevier Ltd. All rights reserve

of bacterial genes,5,6 sequence variation in openreading frames (ORFs) also can have profoundeffects.7–9 The latter is mediated through thepresence or absence of recognition sequences forstimulatory or inhibitory factors such as RNA-binding proteins10,11 or non-coding RNAs12 and,more generally, through variation of three features:codon identity, levels of mRNA secondary structure,and nucleotide composition.A quantitative description of the relationship

between protein expression levels and ORF se-quence features has remained elusive. An averageORF in Escherichia coli potentially can adopt ∼10159iso-coding sequences (this study). Experimentalexploration of such enormous sequence spaces todefine the relationship between sequence andprotein expression levels is very challenging.9

Powerful computational algorithms have beendeveloped to solve the class of huge discrete

d.

Page 2: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

906 Determinants of Protein Expression in ORFs

combinatorial searches that arise in optimizingcodon choice and can be applied to design syntheticsequences for testing critical sequence features thatcontribute to protein expression.13–17 However,before the advent of affordable large-scale DNAsynthesis together with gene assembly automationmethodologies,18,19 it has been impractical to exper-imentally test large numbers of synthetic genes. Inan effort to address these issues, we integrated fourapproaches: we undertook a bioinformatic analysisof available prokaryotic genomic sequences toidentify ORF sequence features that potentiallyinfluence protein expression; developed a genesequence design program (ORFOPT) that tunesregional nucleotide composition, codon choice, andmRNA secondary structure to design syntheticsequences that test critical values and combinationsof such features; used gene assembly automation toconstruct synthetic genes; and used coupled in vitrotranscription and translation (TnT) in E. coli extractsto measure their protein expression levels.The genomic frequency distribution of codons that

encode the same amino acid is often very uneven(codon bias) and can differ dramatically betweenorganisms.20 The presence of infrequently usedcodons in an ORF can significantly depress proteinexpression levels9 and may even affect the fidelity oftranslation.21 There has been considerable specula-tion on the origin and biochemical consequences ofcodon bias, including correlation with tRNApopulations22 and metabolic load.20 Codon biasescan affect translational elongation step rates andribosomal movement23 and may provide pausingpoints during protein folding.24 Optimization ofcodon bias is a common strategy to improveheterologous protein expression using syntheticgenes.4 Nevertheless, the importance of codon biasrelative to other factors in optimizing proteinexpression remains unresolved: one recent studysuggests that codon bias does not correlate well withhigh levels of heterologous protein expression in E.coli,7 whereas another identifies it as a criticaldeterminant by controlling the choice of metaboli-cally available aminoacylated tRNAs.8

RNA secondary structure at the 5′ end of an ORFalso has been recognized to be important for proteinexpression, acting through a variety of mechanisms.One effect involves ‘masking’ of ribosome bindingsites by inverted repeats in the mRNA that cover partof the ORF itself.25 A second effect involves mRNAsecondary structures encoded entirely within the 5′end of the ORF, which are likely to hamper loading ofthe mRNA onto the ribosome at initiation oftranslation.26 Bioinformatic analysis indicates thatmRNAsecondary-structure content decreases towardthe 5′ end of genes in several Eubacteria.27 Further-more, it has been suggested that absence of secondarystructure in the 5′ end ofORFs ismore important thancodon choice in determining expression levels.7,28

There is considerable anecdotal evidence that theAU composition within the 5′ end of an ORF mayalso play a role in expression levels.16,23,29–35Expression patterns of computationally designedgenes with elevated 5′ AU composition suggest thatthe impact of this parameter can be profound.16

Nevertheless, the influence of regional nucleotidecomposition in ORFs on protein expression levelsremains poorly characterized.Our bioinformatic analysis of 816 fully sequenced

bacterial genomes revealed that there are tworeadily discernible canonical ORF sequence featureswithin the 5′ and 3′ ends relative to the middleregions of ORFs: AU content is elevated, andsecondary structure is depressed. To test thecontributions of these features together with high-frequency codon usage, we developed a computa-tional algorithm that enables their systematicvariation in synthetic gene sequences. We con-structed 285 synthetic genes distributed over threeproteins differing in species origin, gene length, andprotein fold. We find that the resulting experimentalexpression levels can be calculated from the ORFmRNA sequence by a nonlinear scoring function.This function comprises a set of sigmoidal thresh-olds in which each feature transitions through acritical region below which it fully or partiallyinhibits expression, above which its contributionplateaus, and within which expression levels aresensitive to its quantitative value. This mathematicalmodel provides important insights into the relativecontributions of the three ORF sequence features:protein expression levels are strongly dependent onthe presence of high AU content and low secondarystructure in the N-terminal segment and that codonchoice contributes when favorable.The bioinformatic analysis together with the

experimental characterization of synthetic genesand resulting mathematical model provides aquantitative mapping between ORF mRNA se-quence and protein expression levels. This quanti-tative model has also enabled us to develop apredictive gene design method that has yieldedsynthetic ORF sequences with high levels of proteinexpression for a variety of proteins (M.A. et al.,unpublished results). Most importantly, it showsthat the properties of the 5′ ORF region play acritical role in determining bacterial protein expres-sion levels.

Results

Statistical analysis of bacterial ORFs

We analyzed regional nucleotide composition,mRNA secondary structure, and codon choice inthe 2.5×106 ORFs of 816 fully sequenced bacterialgenomes that span genomic AT contents ranging

Page 3: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

Fig. 1. Genomic averages and variances of regional ORF nucleotide composition, RNA secondary structure, and CAI.All parameters are shown as mean values and variances calculated over all ORFs within a genome. Blue, 5′ ORF region(first 35 bases); red, middle region; green, 3′ ORF region (last 35 bases). Circles indicate the values of these parameterscalculated for E. coli strain K-12 DH10B. (a) Mean ORF regional nucleotide composition is reported as the ratio of thecomposition of that region to that of the genome average. (b) Variances of the mean genomic regional nucleotidecompositions. (c) Mean ORF regional secondary-structure content is reported as the ratio of a region relative to thegenome average. (d) Variances of the mean ORF regional secondary-structure content. (e) Mean regional codonadaptation indices. (f) Variances of the regional genomic CAI values.

907Determinants of Protein Expression in ORFs

from 25% to 83% (Supplementary Table 1). Compu-tational algorithms and definitions of the statisticalmeasures are described in Materials and Methods.The mean AU content of the first and last 35 baseregions within ORFs is significantly higher than thatof the middle section (Fig. 1a). Of the two terminalregions, the 5′ end tends to have a higher AU biasthan the 3′ end. mRNA secondary-structure contentalso shows significant regional differences, with thetwo ends having lower mean structural content andhigher variance than the middle (Fig. 1c and d).Again, the trend is stronger in the 5′ than in the 3′terminus. The trends in these two sequence featuresare present regardless of genomic nucleotide com-position, but the signal is more pronounced in GC-than in AT-rich genomes. The variances of thenucleotide composition and secondary-structurecontent also are much higher at the two terminithan in the middle region (Fig. 1b and d). Suchincreased variance suggests that some aspect of

control might be encoded in these regions: anincreased level of variance in a parameter indicatesthat genes differ from one another in this respect, aswould be expected for features with regulatoryfunctions.Codon choices can be quantified as the codon

adaptation index (CAI), which varies from 0 to 1,reflecting choice of low- or high-frequency codons,respectively.36 Codon frequencies were calculatedover all the ORFs for each genome individually. Themeans and variances of CAI values averaged over agenome exhibit a complex, but well-defined pattern(Fig. 1e), precluding identification of clear canonicalrules governing codon bias by this approach. TheCAI pattern shows some regional variation. Atgenomic AT contents below ∼50%, the 5′ regionalCAI tends to be significantly lower than thatobserved in either the middle or 3′ regions, whichare indistinguishable from one another. The vari-ance of the 5′ regional CAI values is always higher

Page 4: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

Fig. 2. Experimental expression levels of synthetic genes determined using coupled in vitro TnT reactions in E. coliextracts. (a) Synthetic genes were designed by optimizing CAI, mRNA secondary structure, and 5′ ORF regionalnucleotide composition singly or in combination giving a total of seven conditions. For each condition, the expressionpattern of two alleles differing by at least 10 mutations is shown. Three proteins differing in size, structure, origin, andexpression of wild-type ORF sequences were used: asparate aminotransferase (ttAST), fatty acid binding protein(ggFABP), and triose phosphate isomerase (lmTIM). Proteins were purified from coupled in vitro TnT reactions usingimmobilized metal affinity chromatography and run on 4–12% SDS-PAGE gradient gels. Green fluorescent proteintemplate was included as a positive control for protein expression levels and an extract without added DNA as a negativecontrol. Observed expression levels were classified into one of four categories (blue numbers: 0, no band; 1, weak band; 2,medium band; 3, strong band). Full gel images are shown in Fig. S1. The identity of the observed protein band wasverified byMS for each of the three proteins in the first allele of the optimization condition 7 (Fig. S2). (b–d) Time course ofradiolabeled RNA in TnT reactions containing a high-expression (black) and low-expression (gray) level allele(background of a reaction without added DNA was subtracted): ttAST (b), ggFABP (c), and lmTIM (d). (e–g) Totalradiolabeled RNA at 1 h using one allele for each condition presented in (a) and the wild-type sequences: ttAST (e),ggFABP (f), and lmTIM (g).

908 Determinants of Protein Expression in ORFs

Page 5: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

909Determinants of Protein Expression in ORFs

than that of the other two regions (Fig. 1f), againsuggesting the presence of regulatory function inthis region.

Construction and expression patterns ofsynthetic genes in E. coli

Our bioinformatic analysis indicates that regionalAU content and secondary structure at the beginningand end of ORFs are important features that areconserved across bacterial species. However, thisanalysis provides no further information on theirfunctional importance; neither does it provide guid-ance on codon usage. To test experimentally whichfeatures effect protein expression levels and to assesstheir relative contributions, we developed a computeralgorithm (ORFOPT; see Materials and Methods) thatenables us to specify quantitatively all six parameters(AU content in the 5′ and 3′ ORF termini; secondarystructure in 5′, 3′, and middle segments; and the CAIcalculated over the entire ORF) in synthetic genes.Wedesigned and constructed 285 synthetic genes distrib-uted over three test proteins that vary in size,structure, species origin, and heterologous expressionlevels of their wild-type ORF sequences: asparateaminotransferase [∼43 kDa, (αβα)-sandwich, Ther-mus thermophilus, no expression], fatty acid bindingprotein (∼15 kDa, β-clam, Gallus gallus, goodexpression), and triose phosphate isomerase[∼28 kDa, (αβ)8-barrel, Leishmania mexicana, poor-to-no expression]. Genes were assembled from syntheticoligonucleotides by an automated, robust, PCR-mediated gene construction scheme.19 Full-lengthlinear PCR fragments containing synthetic ORFsflanked by invariant, untranslated control regionsencoding a T7 promoter, ribosome binding site, andT7 terminator were tested for protein expressionusing a TnT extract prepared from theE. coliBL21 Starstrain. The in vitro expression approach provides astandard, well-defined set of conditions for proteinexpression, which can correlate with in vivo expres-sion levels.37 Furthermore, the BL21 Star strain lacksthe C-terminal portion of RNase E,38 thereby disen-tangling complexities associated with endonucleolyticmRNA cleavage39 within the variable portion of thedesigned ORF sequences from effects on translation.The contributions and interplay of the six para-

meters are illustrated qualitatively by 42 syntheticalleles (Fig. 2 and Fig. S1). These alleles weredesigned using seven different calculation condi-tions in which values for the six parameters weretargeted individually and in combination; the valuesof parameters that were not explicitly targeted wereleft unconstrained. The resulting seven conditionswere tested in all three proteins. For each condition,in vitro expression levels of two alleles differing by atleast 10 iso-codon changes were evaluated usingCoomassie-stained gels. To distinguish betweeneffects on transcription or translation, we measured

radiolabeled RNA levels in the reactions after 1 h (atwhich point the RNA levels are highest; Fig. 2b–d)and found that these varied by less than 20–30% intwo cases (Fig. 2f and g) and less than threefold inthe third case (Fig. 2e), whereas protein expressionlevels in the Coomassie-stained gels showed muchgreater variation in all three cases (Fig. 2a), suggest-ing that the differences are due to effects ontranslation. The pattern of observations indicatesthat expression levels are most strongly influencedby high AU content in the 5′ region, followed inimportance by low secondary-structure content.Optimization of the CAI by itself is less effectual.Typically, the highest expression levels are obtainedif all three factors are optimized simultaneously.To build a data set that might enable a quantitative

mapping to be established between ORF mRNAsequence and protein expressions, we constructed atotal of 285 synthetic genes (Fig. S1, SupplementaryTable 2). With exception of the CAI, which was notsampled below 0.45, ORF features were sampledover a wide range (Fig. 3, right column). Theexperimental expression levels of all syntheticgeneswere classified into four categories determinedby inspection of band intensities in 4–12% gradientSDS-PAGE gels: zero (no band), one (weak band),two (medium band), and three (strong band). Theintensities of the strongest bands are comparableacross all three proteins, indicating that the catego-ries can be compared directly between differentproteins. The data set samples all four experimentalcategories (Fig. 4b). We found that the experimentalobservations could not be mapped well to theircorresponding ORF mRNA sequences using linearcombinations of the ORF parameters. We thereforechose to represent each of the six parameters as asum of two sigmoidal curves corresponding topenalty (inhibitory) and reward (stimulatory) con-tributions, respectively. This function was parame-terized against the entire 285 synthetic gene data set,using a simulated annealing algorithm with 10,000independent optimization calculations (see Materi-als and Methods). In addition to obtaining anoptimal parameter set, this ensemble provides aMonte Carlo sampling of near-optimal solutions(Fig. 3, left column). The resulting function (Fig. 3,middle column) accurately calculates experimentallyobserved expression categories for 69% of the dataset, with the remainder being calculated to theirclosest neighboring category (Fig. 4a). The highestand lowest expression levels fit the most precisely(91% and 85%, respectively).The distribution of near-optimal solution values

gives an indication of how well the parameters aredetermined (Fig. 3, left column). The 5′ regional AUcontent (Fig. 3a) and secondary-structure depen-dencies (Fig. 3d) are well determined. The effect of 5′regional nucleotide composition is very pronouncedand centered on a critical point at 53–55% AU

Page 6: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

Fig. 3 (legend on next page)

910 Determinants of Protein Expression in ORFs

Page 7: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

Fig. 4. Correlation between observed and calculatedprotein expression levels. (a) Correlation between calcu-lated and observed expression levels. The frequencies arenormalized to 1 within each predicted category. For 69%of the data calculated and observed, expression categoriesare accurately calculated (diagonal). The remainder isusually off by only one expression level category; (b)distribution of observed protein expression levels (0, noexpression; 1, low expression; 2, medium expression; 3,high expression).

911Determinants of Protein Expression in ORFs

content, above which it is strongly stimulatory andbelow which it is equally strongly inhibitory. Lowsecondary-structure content in this region is stimu-latory and does not become strongly inhibitory untilhigh levels are reached. Features in the 3′ region areless well defined in this data set. Nevertheless, AUcomposition has both a reward and penalty con-tributions above and below ∼57%, respectively, buttheir numerical weights remain ill determined(Fig. 3b). By contrast, 3′ regional secondary structuredoes not appear to play a significant role (Fig. 3f).Contributions of secondary structure in the middleregion are fairly well determined (Fig. 3e): a highstructural content has a modest negative impact,whereas near absence of secondary structure ismoderately favorable. The contributions of the CAIare qualitatively well determined, but uncertaintyregarding the precise weight of penalty and rewardremains (Fig. 3c). The CAI can significantly enhanceexpression levels if above ∼0.8; conversely, if below∼0.5, it is inhibitory, although the precise numericalvalue of this latter effect is not well determined byour sampling.The fits for the dependence on regional nucleotide

composition and the CAI behave as thresholds (Fig. 3,middle column), in which these parameters below acritical value are inhibitory (e.g., 5′ nucleotidecomposition), or do not contribute much (e.g., 3′nucleotide composition), and above which theircontributions plateau. The effect of transitioningthrough a critical value can be very abrupt (e.g., 5′nucleotide composition), or gradual (e.g., 5′ second-ary structure). The overall expression level is the sumof all contributions. For strongly contributory para-meters, such as the 5′ regional nucleotide composi-tion, inhibitory threshold effects can dominate, evenwhen other parameters adopt favorable values. Suchthreshold effects and interplay between parametersare demonstrated by a series of alleles in which the 5′regionalAU contentwas systematically varied (Fig. 5)while maintaining the other five parameters close toconstant, slightly favorable values (Fig. 5a–f). For thettAST alleles 52a–53b (Fig. 5g, top) and the lmTIMalleles 34a–35b (Fig. 5g, bottom), there is a suddenincrease in expression levels on transitioning from

Fig. 3. Parameterization of a mathematical function that calculates protein expression levels from ORF sequence. Thefunction is the sum of six pairs of sigmoids representing reward and penalty contributions of 5′ (a) and 3′ (b) ORF regionalAU composition, the ORF CAI (c), 5′ (d), middle (e), and 3′ (f) ORF regional secondary-structure content. The score of eachcomponent ranges [−200,200]; their sum is mapped onto the protein expression category as b−100→0 (no expression),[−100,0]→1 (low), [0,100]→2 (medium), N100→3 (high). Left column: density plot of the distribution of sigmoids in theensemble of near-optimal solutions. False coloring indicates how many sigmoidal curve segments pass through a region(none, magentabbluebgreenbyellowb red, high). These distributions give an indication of the uncertainty in theparameter set. For instance, although there are many solutions for the 3′ ORF regional composition (b), it is clear that allhave a penalty (lower-left quadrant) and reward (upper-right quadrant) with a critical transition centered at ∼56% (redpeak). Middle column: sigmoids of the parameters set that best fit the data (gray area: penalty score values). Right column:distribution of parameters in the experimental data set (note that for the C-terminal segment, there are 29 alleles withsecondary-structure scores b−500, which are not shown).

Page 8: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

Fig. 5. The effect of varying N-terminal AU content in the presence of (near-) constant other parameters. Eight allelesof ttAST (50a–53b; g, top) and 10 alleles of lmTIM (33a–37b; g, bottom) were constructed in which the 5′ regionalcomposition was varied from 31% to 60%AU content (e), while keeping the other five parameters near constant in a rangewhere they have little effect on the predicted expression score (a–d, f). (a)–(f) show the range of values (red rectangles orcircles) of the six parameters for the 18 alleles, mapped on the scoring function parameterized by the optimal global fit (seeFig. 3). (g) shows the expression levels (blue numbers) of the 18 alleles (identity indicated at the top of each lane; seeSupplementary Table 2 and Supplementary Fig. S1) determined in Coomassie-stained gels. The curves above each laneindicate the mapping of the allelic 5′ regional AU content (shown as percentage at the bottom of each lane) onto thescoring function for this parameter (blue line). Mapping of the allelic values is shown for two critical points: red dots, 55%AU content, obtained from the optimal global fit of all the data (see Fig. 3a, middle); green dots, 53% AU content,corresponding to the lower limit observed in the range of near-optimal fits (see Fig. 3a, left). The latter value exhibits aclear threshold transition for these alleles in these two proteins. In addition to illustrating the effect of transitioningthrough a threshold, these results show that the value of the nucleotide composition critical point is not yet determinedprecisely (2% uncertainty).

912 Determinants of Protein Expression in ORFs

Page 9: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

913Determinants of Protein Expression in ORFs

48.6% to 54.3% (compare ttAST alleles 52a,b and 53a,b, lmTIM alleles 34a and 34b, or lmTIM 35a and 35b)as the 5′ regional composition transitions through itscritical threshold. The four alleles lmTIM 36a–37bwith 57.1% to 60.0% AU content illustrate that if thisthreshold is exceeded, there is not much apparentfurther gain in expression.It is difficult to distinguish between changes in

nucleotide composition and RNA secondary struc-ture, because these are inter-related. However,analysis of the ttAST 52a–53b and the lmTIM 33a–34b alleles, which constitute examples of carefullypaired low- and high-expression alleles in which N-terminal composition was changed while minimiz-ing changes in secondary structure as much aspossible, suggests that nucleotide composition, andnot secondary structure, is primarily responsiblefor the control of protein expression levels by the5′ ORF region. The ttAST alleles 52a,b have lowersecondary-structure content (−82.7, −94.8; see Sup-plementary Material), lower AT content (48.6%),and lower expression level than ttAST 53a,b (−103.3,−117.3; 54.3%). Thus, the expression level changesdramatically (1→3) despite a gain, rather thanloss, in secondary structure. The situation for thepaired low-expression (34b,35b) and high-expres-sion (34a,35a) alleles constructed in lmTIM is lessclear, because a large increase in protein expression(1→3) is accompanied by both a transitioningthrough the nucleotide composition critical region(48.6→54.3%) and a slight loss of secondary-structure content (−63.8,−59.2→−45.8,−49.2).However, the magnitude of the loss of secondary-structure score in the lmTIM pair (∼15) is less thanthe gain in the ttAST pair (∼20), suggesting that theeffects are due to changes in nucleotide composition,not secondary structure. This dominance of compo-sitional effects over secondary-structure content isalso reflected by the contributions of these terms inthe expression function, established using the entireset of synthetic alleles (Fig. 3).

Discussion

Our bioinformatic analysis revealed that bacterialgenes have distinct regional trends in nucleotidecomposition and RNA secondary-structure contentwithin the first and last 35 bases of ORFs, which arepresent regardless of genomic nucleotide composi-tion. Our experimental analysis of synthetic genesshows that protein expression levels are stronglydependent on the presence of high AU content andlow secondary structure in the 5′ region and thatchoice of high-frequency codons contributes to alesser extent. Furthermore, we find that the 3′regional AU content makes a modest, but detectablecontribution to protein expression, an effect notpreviously observed.

The size of the synthetic gene data set and therange of values of the parameters enabled us todevelop a mathematical function that calculatesprotein expression levels from ORF sequences. Thenonlinear character of this equation emphasizes thecomplexity of the multifactorial effects, illustrateswhy it is so difficult to unravel these factors, andfurther emphasizes that correlation between proteinand mRNA levels should be approached withcaution.40 Even with the use of computer algo-rithms, we have not been able to obtain manyexamples in which variation in one parameter iscleanly separated from changes in others. We note afew caveats to our approach. First, it is likely that asmore data become available, the parameterization ofthe equation will change. Second, the ORFOPTcomputer algorithm addresses neither messagetranscription levels, nor mRNA lifetimes, norspecific sequence elements that bind factors thataffect protein expression. Finally, the method issilent on aspects of post-translational processing andphysicochemical properties of proteins that affectsolubility and turnover, as these require optimiza-tion of the amino acid sequence.The universe of mRNA iso-coding sequences is

vast, as illustrated by the 10159 variations that canbe encoded by an average ORF in E. coli. It istherefore remarkable that quantitatively predictiverules that map mRNA ORF nucleotide sequence toprotein expression levels can be obtained inrelatively limited experimental explorations of thissequence space. This achievement indicates that theencoding of the factors determining expressionlevels is based on highly degenerate mRNAsequence features, which can be captured mathe-matically to a first approximation. Quantitativemapping between ORF mRNA sequences andprotein expression enables the development ofcomputer programs for the design of syntheticgenes optimized for heterologous protein expres-sion, which is an important goal for biotechnologyand synthetic biology.13–17 The success of ourapproach is illustrated here by the design of well-expressed genes for L. mexicana triosephosphateisomerase and T. thermophilus asparate aminotrans-ferase, of which the wild-type sequences expressbarely, if at all. The ORFOPT program also has beenused successfully to predict synthetic genesequences that express a variety of proteins athigh levels (M.A. et al., unpublished results).The contributions of themRNAsequence features in

the ORF region affect the three phases of translation:initiation, elongation, and termination.41–43 Initiationinvolves the highly regulated loading of mRNAonto the ribosome.26 Similarly, termination is acarefully orchestrated process, involving recogni-tion of the stop codon by release factors.44 Inaddition to its practical application in the design ofheterologous expression systems, the equation that

Page 10: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

914 Determinants of Protein Expression in ORFs

predicts expression levels from sequence featuresmay provide mechanistic insight into factors thatcontrol translation efficiency. Its functional formand the numerical weights of its terms can bequalitatively interpreted in terms of control logicand the importance of individual factors. Depend-ing on its parameterization (σ, μ, W; see Materialsand Methods), a sigmoidal representation encodesdifferent logic, ranging from a switch (very highsigmoidicity), to linear response (no sigmoidicity),to absence of contribution (zero weight). Thepenalty and reward components of the functioncould be interpreted as representing inhibitory orstimulatory effects.With these notions in mind, the empirical function

linking sequence and protein expression level (Fig. 3,middle column) could be interpreted in terms of adescriptive model for control of prokaryotic proteinexpression levels through modulation of basal ribo-somal activity by sequence features within an ORFmRNA. We distinguish between two effects: passive,inhibiting basal ribosomal activity, and active, stim-ulating ribosomal activity through recruitment ofextrinsic factors or activation of intrinsic ribosomalfunction. The parameterization of the function sug-gests that compliance of codon choice with availabletRNA populations (set by availability of tRNAtranscript,22,23 metabolic control of aminoacylation,8

and which in this study is approximated by the CAI)provides passive control: low-frequency codons areinhibitory, and high-frequency codons exhibit only amild stimulatory effect (Fig. 3c). Control presumablyemerges from the effect of the concentration of theavailable tRNAs on the rate of their complexformation with the ribosome. Secondary structurealso exercises largely passive control, with mildstimulatory effects observed only at low levels ofstructure (Fig. 3d–f). These inhibitory effects pre-sumably arise from the need to unwind secondarystructure as the single-stranded mRNA is threadedthrough the ribosome.41–43 By contrast, the effect ofnucleotide composition at the 5′ end (Fig. 3a) and, toa lesser extent, at the 3′ end (Fig. 3b) appears tohave a strong active component, suggestive ofrecruitment of an extrinsic or intrinsic stimulatoryfunction.The critical contribution of the 5′ ORF end has

been noted also by others.16,23,29–35 The inter-relationships between nucleotide composition, sec-ondary structure, and codon compliance remainopen to debate and are challenging to dissect. Weand others28 find that elevated secondary structurein this region tends to diminish protein expressionbut that this effect does not dominate. It has beenproposed that choice of low-compliance codons inthis region is a dominant effect universally con-served in all domains of life.23 The hypothesis is thatregional low-compliance codons locally slow ribo-some progress, thereby regulating downstream

traffic and preventing downstream abortive trans-lation events arising frommultiple stalled ribosomesor inter-ribosomal collision. However, regionalcodon non-compliance and regional nucleotidecomposition are inter-related, because one affectsthe other. If regional non-compliance is the domi-nant cause, and nucleotide composition a side effect,we should observe regional non-compliance inde-pendent of genomic composition, which, in turn,should result in AT-rich 5′ regions in GC-richorganisms and GC-rich 5′ regions in AT-richorganisms. We observe a different pattern, however(Fig. 1e): regional non-compliance is observed onlyin GC-rich organisms (up to ∼50% genomic ATcontent), whereas elevated regional AT can bedetected even within AT-rich organisms. This sug-gests strongly that it is nucleotide composition that isthe dominant factor and codon non-compliance thatis a side effect. The latter is observed only in GC-richorganisms where maintenance of elevated ATcontent skews codon choice to non-compliance. Wenote, however, that this conclusion is based on theuse of the CAI, which is only a crude estimate ofcodon compliance; a definitive analysis requires theuse of the tRNA adaptation index used in otherstudies22,23 (work in progress).If nucleotide composition is the primary determi-

nant of the stimulatory properties of the 5′ORF and,to a lesser extent, the 3′ ORF regions, whatmolecular mechanism(s) could account for thiseffect? As we noted above, the nucleotide composi-tion contributions appear to have the hallmarks ofan active rather than passive effect and, therefore,are likely to arise from recruitment of an extrinsicfactor or stimulation of intrinsic ribosomal function.One possible mechanism could involve binding ofRNA helicases that catalyze the unfolding ofsecondary structure. Nucleotide composition couldencode semi-specific recognition by such a helicaseactivity, giving rise to threshold effects throughbinding events. For instance, the DEAD-box proteinsuperfamily includes prokaryotic RNA helicasesthat recognize AU-rich sequences45 and, in E. coli,are involved in the RNA degradosome39 or ribo-somal RNA maturation.46 DEAD-box helicases playa role in eukaryotic initiation of translation, but nosuch function has been reported in prokaryotes.45

The ribosome itself contains an mRNA helicaseactivity that acts on a position within 11 bases fromthe codon that is being read.47 We hypothesize thatlocal nucleotide composition could influence thisactivity. If this is the case, the patterns of nucleotidecomposition variance observed in all prokaryotesreflect increased ribosomal helicase activity at thebeginning and end of ORFs, respectively, to enhancepost-initiation threading of the mRNA into theribosome and access of release factors at termina-tion. The involvement of a composition-sensitivehelicase activity (extrinsic or intrinsic) also could

Page 11: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

915Determinants of Protein Expression in ORFs

account for the difficulty in separating out contribu-tions from nucleotide composition and RNA sec-ondary structure. Both factors contribute tosubstrate recognition, and the latter additionallydetermines catalytic efficiency.Translation and mRNA unwinding are tightly

coupled in the elongation phase,48 accounting forthe observed boost in expression levels observed atlow secondary-structure content for the middlesegment. The lack of penalty at intermediatestructural content again may reflect RNA helicaseactivity. The CAI also affects the elongation phase asit tends to reflect the relative concentrations ofavailable tRNA pool.20 In our TnT reactions, thetRNA concentrations relative to each other areprobably similar to in vivo ratios, because a total E.coli tRNA extract is added to the reactions, but theirabsolute concentration is higher than in vivo levels.Even so, we find that favorable CAI values influenceprotein expression levels, but that this effect dropsoff at values below ∼0.8. The lowering of theregional CAI in the N-terminal region of GC-richbacteria presumably reflects the dominance of AUcontent over CAI at initiation, because in thosegenomes, AU-rich codons are less frequent andtherefore have a lower CAI.Regardless of its detailed mechanistic origin, the

sequence of the 5′ORF region plays a dominant rolein determining prokaryotic protein expressionlevels.16,23,29–35 This and other studies23,28 haveproposed different hypotheses for the underlyingmolecular mechanism(s) of this effect, many ofwhich are open to direct experimental testing. Welook forward to learning the answer to thisimportant riddle.

Materials and Methods

Bioinformatics

Annotated sequence files for 816 complete bacterialgenomes were downloaded†. Custom software was devel-oped to calculate nucleotide composition, RNA secondarystructure, and codonadaptation indices. Limits for theORFswere taken from the annotations in the genome sequencefiles. Regional nucleotide composition is calculated as theratio of A andT content in the region relative to the segmentlength. RNA secondary structure is representedby a scoringfunction based on inverted repeats in a region, weighted bytheir base-pairing character, and loop or bulge sizes. Thisscore was calculated by centering a recursive search ofmaximally sized inverted repeats within 100 bp from agiven nucleotide; duplexes are allowed to contain 10-bpbulges, 10-bp gaps, and amaximum 30-base loop. The scorefor such a maximally sized inverted repeat is determinedusing stem–loop base-pairing energies49,50 and assigned to

†http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi

each nucleotide encompassed within the calculated stem–loop. The secondary structure score of a region is thesummed score for all stem–loops assigned to each base,normalized by the region length. Codon adaptationindices36 are calculated according to codon tables cons-tructed independently for each genome as the frequencydistribution of codons in that genome. The CAI value for anORF or region is the geometric mean of all the codons insuch segments. Arithmetic means and standard deviationsfor genomic CAI values are calculated over the CAI valuesdetermined for each ORF in a genome. The number ofpossible iso-coding sequences for anORF is calculated as thesum of the logarithm of the number of possible codons ateach position. The mean genomic iso-coding sequencediversity is the arithmetic mean of the diversities of all theORFs.

Computational design of synthetic ORF sequences

A simulated annealing algorithm was used to minimizean objective function capturing sequence features ofinterest within the available degrees of freedom in theith trial Ei =

P vwrvr +P swrsr + cwc, where νwr and

swrare the relative weights assigned regional nucleotidecomposition (νr) and RNA secondary structure (sr),respectively (r: 5′, middle, 3′), and cw and c are the ORFCAI weight and value, respectively. For minimizations inwhich only subsets of parameters were optimized, theweights for the unconstrained parameters were set to zero.Two types of calculations were run: absolute minimizationof the objective function or achievement of a target value.For target value optimization, the objective function was

modified to Ei =P6i=1

wi pi− tið Þ2, where it represents weights,

pi denotes the parameters, and ti represents the targetvalues. Sequence trial configurations were generated byrandomly choosing two iso-codons per trial. At thebeginning of the simulation, a cutoff of 0.45 was appliedto remove low-frequency codons (this is the reason thatsynthetic ORFs with low CAI values were not sampled).Alleles were generated by maintaining a dynamic list of∼100 sequences differing by at least 10 mutations from thecurrent best sequence and each other. A dynamic coolingschedule was used to drive the simulated annealingprogress: to determine whether the ith trial is acceptable,ΔEi=Ei−Ei−1 was calculated and i was accepted either ifΔEi≤0 or in a Boltzmanndecision if pib e

−ΔEi/T, where pi is arandom number [0,1] and T is a control parameter. After1000 trials, the acceptance rate r was assessed and T waschanged if rN0.250, Tn+1=0.8Tn or rb0.225, Tn+ 1=1.3Tn.Final outcomes of such minimizations are critically

dependent on the choice of weights. We used twoapproaches to address this problem. In one method, weassigned weights empirically in a successive number oftrial-and-error calculations. In a second method, wedeveloped a new Boltzmann decision scheme thatcircumvents the issue of weights and enables parameterswith different numerical magnitudes to be combined. Inthis method, we used three independent Boltzmanndecisions for each of the parameter classes, respectively,resulting in a ‘vote’ νi=

νβi+sβi+

cβi where βi={1,0} andcaptures the outcome of a Boltzmann decision for a given

Page 12: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

916 Determinants of Protein Expression in ORFs

parameter class (ν, regional nucleotide composition; c,codon preference; s, regional secondary structure). Unan-imous votes (νi=3) are always accepted; majority votes(νi=2) are accepted half the time, and minority rule (νi=1)is accepted only if the overall acceptance rate has droppedbelow a 5% threshold value. Between 2 and 20 runs wereexecuted in parallel on a Beowulf cluster and merged toconstruct the final set of alleles. All synthetic ORFs wereoptimized within the context of the invariant 5′ and 3′untranslated regions. The resulting sequences were fedinto an automated experimental gene assembly pipeline(see below).

Parameterization of a function that predicts proteinexpression levels from ORF sequence

We developed a function in which an expression scoreis given as a sum of a series of thresholds applied to thecomposition, structure, and codon usage values of asequence E=τν(5′)+τν(3′)+τs(5′)+τs (middle)+τs(3′)+τc,where τν and τs are regional thresholds of nucleotidecomposition and structure, respectively, and τc is the CAIthreshold. A given threshold score for a feature value x isthe sum of two sigmoids H = Wp 1 + e−3 x−lpð Þ=rp� �−1

+Wr 1 + e−3 x−lrð Þ=rr� �−1

where p and r denote parametersfor penalty and reward phases, respectively, W representsweight, μ denotes midpoint, and σ represents sigmoidicityof each curve. The final value of the scoring function is thesum of all six components and is mapped onto theexpression level categories as ≤−100→0 (no expression),[−100,0]→1 (weak expression), [0,100]→2 (medium ex-pression), and N100→3 (high expression). Parameters werefit as a minimization of the sum of the absolute differencesbetween observed and calculated expression categoriesusing a simulated annealing algorithm.

Oligonucleotide synthesis and synthetic geneassembly

Full-length genes (0.65–1.40 kb) encoding a syntheticORF flanked by 5′ (122 or 131 bp) and 3′ (103 or 112 bp)regulatory regions were assembled from oligonucleotides(80–100 bases) synthesized in-house (Mermade 192 DNASynthesizer, BioAutomation) using an automated PCR-mediated gene assembly procedure.19,51 Full-length pro-ducts were verified by agarose gel electrophoresis andreamplified with biotinylated flanking primers thatprovide some protection against endogenous exonucleaseactivity in the subsequent TnT reaction. Oligonucleotidesynthesis and ORF assembly are detailed in the Supple-mentary Information.

‡http://www.genome.duke.edu/cores/proteomics/sample-preparation/

In vitro coupled TnT reactions

We used a TnT system based on the PANOxSP E. coliS30 lysate system.52,53 Lysate was prepared from BL21Star (DE3) E. coli cells (Invitrogen) grown to mid-log phasein shaking culture flasks, rinsed of medium, flash-frozen,thawed, lysed in a French press, centrifuged to removecellular debris, and incubated to facilitate a ‘run-off’ of any

mRNA still bound to ribosomal complexes. The lysate wasdialyzed, centrifuged to remove precipitate, and stored inflash-frozen aliquots. Reactions were initiated by addingbiotinylated linear PCR template (1 μg per 100 μl reaction)to the lysate with magnesium glutamate, ammoniumglutamate, potassium glutamate, ribonucleotide tripho-sphates, folinic acid, total E. coli tRNAs, amino acids,phosphoenolpyruvate, nicotinamide adenine dinucleo-tide, coenzyme A, oxalic acid, putrescine, spermidine,and rifampicin. The components were mixed gently byrepeated pipetting and incubated for 5 h at 30 °C, 500 rpm,in an RTS ProteoMaster (Roche), in reaction tubes sealedwith Air Pore membrane (Qiagen). After incubation,expressed protein was purified by affinity chromatogra-phy (see below). Additional details can be found in theSupplementary Information.

Purification of proteins encoded by synthetic genes

All proteins were constructed with a C-terminalhexahistidine fusion and purified using EZview RedHIS-Select HC Nickel Affinity Gel (Sigma Aldrich). Asuspended gel slurry of 50 μl was washed with 1 mlloading buffer (20 mM Mops, 7.5 mM imidazole, and500 mM NaCl, pH 7.5). Completed TnT reactions (25, 50,or 100 μl) were combined with the affinity gel and 1 mlloading buffer, captured at 4 °C for 1 h rotating end-over-end in a Mini LabRoller (Labnet International), washedwith loading buffer (1 ml twice), and eluted with 100 μlelution buffer (20 mM Mops, 400 mM imidazole, and500 mM NaCl, pH 7.5) incubating for 30 min at 4 °C(rotating end-over-end). Each sample was concentratedusing Vivaspin 500 centrifugal concentrators (SartoriusStedim, 5 kDa molecular mass cutoff; pre-incubated with2 mg/ml bovine serum albumin in phosphate-bufferedsaline buffer for 12–16 h at 4 °C, and washed with waterbefore use) by centrifugation at 13,500g for 10–15 min. Theentire final volume (∼25 μl) was loaded onto one lane ofan SDS-PAGE gradient gel (NuPAGE 4–12% Bis-Tris,Invitrogen); the gel was stained with GelCode Blue StainReagent (Thermo Fisher Scientific). Polyhistidine-taggedGFP template was included in each experiment as apositive expression and purification control; a reactionwithout added DNA template was used as a negativecontrol. The seven condition experiments (Fig. 2 and Fig.S1) for all three scaffolds were tested in at least threeindependent experiments and the majority of the othersynthetic genes were tested in at least two independentexperiments.

Protein identification by mass spectrometry

Liquid chromatography (LC)–tandem mass spectrome-try (MS/MS) was used to confirm protein identity byanalysis of peptides generated from in-gel tryptic digests.Samples were prepared according to the in-gel digestionprotocol‡. Approximately half of the sample from each gelband was analyzed on a nanoAcquity LC and SynaptHDMS system (Waters Corporation) using a 30-min LC

Page 13: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

917Determinants of Protein Expression in ORFs

gradient, with the top three precursor ions from each MSscan selected for MS/MS sequencing. Raw data wereprocessed using Mascot Distiller v2.0 and searchedagainst the Swiss-Prot database (v57.11) using Mascotv2.2 (Matrix Sciences), allowing for fixed modification ofCys (carbamidomethylation) and variable modification ofMet (oxidation). Scaffold (v2.6) software was used toanalyze the data. Sequence coverage obtained from thisanalysis for each of the proteins is shown in Fig. S2.

Determination of mRNA levels

mRNA levelswere determined by addition of 10μCi ofα-labeled rATP (Perkin Elmer) to a TnT reaction. Aliquots(10 μl) were removed and mixed with 100 μl Trizol(Invitrogen), incubated (5 min, room temperature), fol-lowed by addition of 20 μl chloroform (3 min at ambienttemperature), vortexing (15 s), centrifugation at 12,000g toseparate phases (15 min), and aspiration of the RNA-containing aqueous phase, which was subsequently passedthrough a NucAway Spin Column (Applied Biosystems) toremove unincorporated label. The resulting ∼50-μl eluatewas mixed with 200 μl of OptiPhase SuperMix scintillationcocktail (Perkin Elmer) and label incorporation wasmeasured in aMicroBeta Trilux scintillation counter (PerkinElmer). To determine an optimal assay time point, weconstructed a time course for representative poorly andhighly expressing DNA templates for each of the threeproteins.Near-maximal label incorporationwas observed at1 h; this time point was used subsequently to characterizethe RNA levels of the alleles.

Accession numbers

Genbank accession numbers for the genes constructed inthis experiment are provided in Supplementary Table II.Supplementary materials related to this article can be

found online at doi:10.1016/j.jmb.2010.08.010.

Acknowledgements

We thank J. Will Thompson and the DukeUniversity Proteomics Core Facility for proteinidentification by MS and Philippe Marguet, CurtisLayton, and Chris Nicchitta for critical reading ofthe manuscript. This work was supported by theNIH Director’s Pioneer Award (5DPI OD000122).

Conflict of Interest. The authors declare thatthey have no conflict of interest.

References

1. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. &Weissman, J. S. (2009). Genome-wide analysis in vivoof translation with nucleotide resolution using ribo-some profiling. Science, 324, 218–223.

2. Carothers, J. M., Goler, J. A. & Keasling, J. D. (2009).Chemical synthesis using synthetic biology. Curr.Opin. Biotechnol. 20, 498–503.

3. Andrianantoandro, E., Basu, S., Karig, D. K. & Weiss,R. (2006). Synthetic biology: new engineering rules foran emerging discipline. Mol. Syst. Biol. 2, 2006.0028.

4. Jana, S. & Deb, J. K. (2005). Strategies for efficientproduction of heterologous proteins in Escherichia coli.Appl. Microbiol. Biotechnol. 67, 289–298.

5. Winkler, W. C. & Breaker, R. R. (2005). Regulation ofbacterial gene expression by riboswitches. Annu. Rev.Microbiol. 59, 487–517.

6. Osada, Y., Saito, R. & Tomita, M. (1999). Analysis ofbase-pairing potentials between 16S rRNA and 5′UTRfor translation initiation in various prokaryotes.Bioinformatics, 15, 578–581.

7. Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B.(2009). Coding-sequence determinants of gene expres-sion in Escherichia coli. Science, 324, 255–258.

8. Welch, M., Govindarajan, S., Ness, J. E., Villalobos, A.,Gurney, A., Minshull, J. & Gustafsson, C. (2009).Design parameters to control synthetic gene expres-sion in Escherichia coli. PLoS ONE, 4, e7002.

9. Welch, M., Villalobos, A., Gustafsson, C. & Minshull,J. (2009). You're one in a googol: optimizing genes forprotein expression. J.R. Soc. Interface, 6, S467–S476.

10. Mattheakis, L., Vu, L., Sor, F. & Nomura, M. (1989).Retroregulation of the synthesis of ribosomal proteinsL14 and L24 by feedback repressor S8 in Escherichiacoli. Proc. Natl Acad. Sci. USA, 86, 448–452.

11. Jenner, L., Romby, P., Rees, B., Schulze-Briese, C.,Springer, M., Ehresmann, C. et al. (2005). Translationaloperator of mRNA on the ribosome: how repressorproteins exclude ribosomebinding.Science, 308, 120–123.

12. Eddy, S. R. (2001). Non-coding RNA genes and themodern RNA world. Nat. Rev., Genet. 2, 919–929.

13. Puigbo, P., Guzman, E., Romeu, A. & Garcia-Vallve, S.(2007). OPTIMIZER: a web server for optimizing thecodon usage of DNA sequences. Nucleic Acids Res. 35,W126–W131.

14. Wu, G., Dress, L. & Freeland, S. J. (2007). Optimalencoding rules for synthetic genes: the need for acommunity effort. Mol. Syst. Biol. 3, 134.

15. Lorimer, D., Raymond, A., Walchli, J., Mixon, M.,Barrow, A., Wallace, E. et al. (2009). Gene composer:database software for protein construct design, codonengineering, and gene synthesis. BMC Biotechnol. 9, 36.

16. Voges, D., Watzele, M., Nemetz, C., Wizemann, S. &Buchberger, B. (2004). Analyzing and enhancingmRNA translational efficiency in an Escherichia coliin vitro expression system. Biochem. Biophys. Res.Commun. 318, 601–614.

17. Puigbo, P., Romeu, A. & Garcia-Vallve, S. (2008).HEG-DB: a database of predicted highly expressedgenes in prokaryotic complete genomes under trans-lational selection. Nucleic Acids Res. 36, D524–527.

18. Czar, M. J., Anderson, J. C., Bader, J. S. & Peccoud, J.(2009). Gene synthesis demystified. Trends Biotechnol.27, 63–72.

19. Cox, J. C., Lape, J., Sayed, M. A. & Hellinga, H. W.(2007). Protein fabrication automation. Protein Sci. 16,379–390.

20. Hershberg, R. & Petrov, D. A. (2008). Selection oncodon bias. Annu. Rev. Genet. 42, 287–299.

Page 14: Allert2010-Multifactorial_determinants_of_protein_expression_in_prokaryotic_open_reading_frames

918 Determinants of Protein Expression in ORFs

21. Calderone, T. L., Stevens, R. D. & Oas, T. G. (1996).High-level misincorporation of lysine for arginine atAGA codons in a fusion protein expressed inEscherichia coli. J. Mol. Biol. 262, 407–412.

22. dos Reis, M., Savva, R. & Wernisch, L. (2004). Solvingthe riddle of codon usage preferences: a test fortranslational selection.Nucleic Acids Res. 32, 5036–5044.

23. Tuller, T., Carmi, A., Vestsigian, K., Navon, S., Dorfan,Y., Zaborske, J. et al. (2010). An evolutionarilyconserved mechanism for controlling the efficiencyof protein translation. Cell, 141, 344–354.

24. Komar, A. A. (2009). A pause for thought along the co-translational folding pathway. Trends Biochem. Sci. 34,16–24.

25. Kozak, M. (2005). Regulation of translation via mRNAstructure in prokaryotes and eukaryotes. Gene, 361,13–37.

26. Marintchev, A. & Wagner, G. (2004). Translationinitiation: structures, mechanisms and evolution.Q. Rev. Biophys. 37, 197–284.

27. Katz, L. & Burge, C. B. (2003). Widespread selectionfor local RNA secondary structure in coding regionsof bacterial genes. Genome Res. 13, 2042–2051.

28. Tuller, T., Waldman, Y. Y., Kupiec, M. & Ruppin, E.(2010). Translation efficiency is determined by bothcodon bias and folding energy. Proc. Natl Acad. Sci.USA, 107, 3645–3650.

29. Etchegaray, J. P. & Inouye, M. (1999). Translationalenhancement by an element downstream of theinitiation codon in Escherichia coli. J. Biol. Chem. 274,10079–10085.

30. Qing, G., Xia, B. & Inouye, M. (2003). Enhancement oftranslation initiation by A/T-rich sequences down-stream of the initiation codon in Escherichia coli. J. Mol.Microbiol. Biotechnol. 6, 133–144.

31. Moll, I., Huber, M., Grill, S., Sairafi, P., Mueller, F.,Brimacombe, R. et al. (2001). Evidence against aninteraction between the mRNA downstream box and16S rRNA in translation initiation. J. Bacteriol. 183,3499–3505.

32. Sprengart, M. L., Fuchs, E. & Porter, A. G. (1996). Thedownstream box: an efficient and independent trans-lation initiation signal in Escherichia coli. EMBO J. 15,665–674.

33. Rush, G. J. & Steyn, L. M. (2005). Translationenhancement by optimized downstream boxsequences in Escherichia coli and Mycobacteriumsmegmatis. Biotechnol. Lett. 27, 173–179.

34. Zhang, X., Guo, P. & Jing, G. (2003). A vector with thedownstream box of the initiation codon can highlyenhance protein expression inEscherichia coli.Biotechnol.Lett. 25, 755–760.

35. Keum, J. W., Ahn, J. H., Choi, C. Y., Lee, K. H., Kwon,Y. C. & Kim, D. M. (2006). The presence of a commondownstream box enables the simultaneous expressionof multiple proteins in an E. coli extract. Biochem.Biophys. Res. Commun. 350, 562–567.

36. Sharp, P. M. & Li, W. H. (1987). The codon adaptationindex—a measure of directional synonymous codonusage bias, and its potential applications.Nucleic AcidsRes. 15, 1281–1295.

37. Coleman, M. A., Lao, V. H., Segelke, B. W. & Beernink,P. T. (2004). High-throughput, fluorescence-basedscreening for soluble protein expression. J. ProteomeRes. 3, 1024–1032.

38. Lopez, P. J., Marchand, I., Joyce, S. A. & Dreyfus, M.(1999). The C-terminal half of RNase E, whichorganizes the Escherichia coli degradosome, partici-pates in mRNA degradation but not rRNA processingin vivo. Mol. Microbiol. 33, 188–199.

39. Carpousis, A. J. (2007). The RNA degradosome ofEscherichia coli: an mRNA-degrading machine assem-bled on RNase E. Annu. Rev. Microbiol. 61, 71–87.

40. de Sousa Abreu, R., Penalva, L. O., Marcotte, E. M.& Vogel, C. (2009). Global signatures of proteinand mRNA expression levels. Mol. Biosyst. 5,1512–1526.

41. Ramakrishnan, V. (2002). Ribosome structure and themechanism of translation. Cell, 108, 557–572.

42. Bashan, A. & Yonath, A. (2008). Correlating ribosomefunction with high-resolution structures. TrendsMicrobiol. 16, 326–335.

43. Steitz, T. A. (2008). A structural understanding of thedynamic ribosome machine.Nat. Rev., Mol. Cell Biol. 9,242–253.

44. Petry, S., Weixlbaumer, A. & Ramakrishnan, V. (2008).The termination of translation. Curr. Opin. Struct. Biol.18, 70–77.

45. Rocak, S. & Linder, P. (2004). DEAD-box proteins: thedriving forces behind RNA metabolism. Nat. Rev.,Mol. Cell Biol. 5, 232–241.

46. Iost, I. & Dreyfus, M. (2006). DEAD-box RNAhelicases in Escherichia coli. Nucleic Acids Res. 34,4189–4197.

47. Takyar, S., Hickerson, R. P. & Noller, H. F. (2005).mRNA helicase activity of the ribosome. Cell, 120,49–58.

48. Wen, J. D., Lancaster, L., Hodges, C., Zeri, A. C.,Yoshimura, S. H., Noller, H. F. et al. (2008). Followingtranslation by single ribosomes one codon at a time.Nature, 452, 598–603.

49. Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N.,Caruthers, M. H., Neilson, T. & Turner, D. H. (1986).Improved free-energy parameters for predictions ofRNA duplex stability. Proc. Natl Acad. Sci. USA, 83,9373–9377.

50. Jaeger, J. A., Turner, D. H. & Zuker, M. (1989).Improved predictions of secondary structures forRNA. Proc. Natl Acad. Sci. USA, 86, 7706–7710.

51. Gao, X., Yo, P., Keith, A., Ragan, T. J. & Harris, T. K.(2003). Thermodynamically balanced inside-out(TBIO) PCR-based gene synthesis: a novel method ofprimer design for high-fidelity assembly of longergene sequences. Nucleic Acids Res. 31, e143.

52. Jewett, M. C. & Swartz, J. R. (2004). Substratereplenishment extends protein synthesis with an invitro translation system designed to mimic thecytoplasm. Biotechnol. Bioeng. 87, 465–472.

53. Jewett, M. C. & Swartz, J. R. (2004). Mimicking theEscherichia coli cytoplasmic environment activateslong-lived and efficient cell-free protein synthesis.Biotechnol. Bioeng. 86, 19–26.