journal of structural biology - cesco bioproductsthe novel structural information generated can then...

13
The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium Rong Xiao, Stephen Anderson, James Aramini, Rachel Belote, William A. Buchwald, Colleen Ciccosanti, Ken Conover, John K. Everett, Keith Hamilton, Yuanpeng Janet Huang, Haleema Janjua, Mei Jiang, Gregory J. Kornhaber, Dong Yup Lee, Jessica Y. Locke, Li-Chung Ma, Melissa Maglaqui, Lei Mao, Saheli Mitra, Dayaban Patel, Paolo Rossi, Seema Sahdev, Seema Sharma, Ritu Shastry, G.V.T. Swapna, Saichu N. Tong, Dongyan Wang, Huang Wang, Li Zhao, Gaetano T. Montelione * , Thomas B. Acton ** Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Robert Wood Johnson Medical School, and Northeast Structural Genomics Consortium, 679 Hoes Lane, Piscataway, NJ 08854, United States article info Article history: Received 4 April 2010 Received in revised form 24 July 2010 Accepted 28 July 2010 Available online 3 August 2010 Keywords: Structural genomics High-throughput protein production Construct optimization Disorder prediction Ligation-independent cloning Multiple Displacement Amplification Laboratory Information Management System Protein Structure Initiative NMR X-ray crystallography T7 Escherichia coli expression system Wheat germ cell-free NMR microprobe screening Parallel protein purification 6X-His tag HDX-MS abstract We describe the core Protein Production Platform of the Northeast Structural Genomics Consortium (NESG) and outline the strategies used for producing high-quality protein samples. The platform is cen- tered on the cloning, expression and purification of 6X-His-tagged proteins using T7-based Escherichia coli systems. The 6X-His tag allows for similar purification procedures for most targets and implementa- tion of high-throughput (HTP) parallel methods. In most cases, the 6X-His-tagged proteins are sufficiently purified (>97% homogeneity) using a HTP two-step purification protocol for most structural studies. Using this platform, the open reading frames of over 16,000 different targeted proteins (or domains) have been cloned as >26,000 constructs. Over the past 10 years, more than 16,000 of these expressed protein, and more than 4400 proteins (or domains) have been purified to homogeneity in tens of milligram quan- tities (see Summary Statistics, http://nesg.org/statistics.html). Using these samples, the NESG has depos- ited more than 900 new protein structures to the Protein Data Bank (PDB). The methods described here are effective in producing eukaryotic and prokaryotic protein samples in E. coli. This paper summarizes some of the updates made to the protein production pipeline in the last 5 years, corresponding to phase 2 of the NIGMS Protein Structure Initiative (PSI-2) project. The NESG Protein Production Platform is suit- able for implementation in a large individual laboratory or by a small group of collaborating investigators. These advanced automated and/or parallel cloning, expression, purification, and biophysical screening technologies are of broad value to the structural biology, functional proteomics, and structural genomics communities. Ó 2010 Elsevier Inc. All rights reserved. 1. Introduction The Northeast Structural Genomics Consortium (NESG) 1 project (http://www.nesg.org), is one of the four National Institutes of Health (NIH)-funded structural genomics Large Scale Centers (LSC) of the National Institute of General Medical Sciences (NIGMS) Protein Structure Initiative (PSI). The primary goal of these struc- ture production centers is to determine the three-dimensional 1047-8477/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jsb.2010.07.011 * Corresponding author. ** Corresponding author. E-mail addresses: [email protected] (G.T. Montelione), [email protected] (T.B. Acton). 1 Abbreviations used: 6X-His, hexa-histidine polypeptide sequence tag; 3D, three-dimensional; HCPIN, Human Cancer Pathway Protein Interaction Network; HDX-MS, amide hydrogen deuterium exchange with mass spectrometry detection; HMM, hidden Markov model; HTP, high throughput; LIC, ligation independent cloning; LSC, Large Scale Centers; MALDI-TOF, matrix-assisted laser-desorption-induced time-of-flight; MCS, multiple cloning site; MDA, Multiple Displacement Amplification; MMLV, Moloney Mouse Leukemia Virus; NESG, Northeast Structural Genomics Consortium; NIGMS, National Institute of General Medical Sciences; NMR, nuclear magnetic resonance spectroscopy; PCR, polymerase chain reaction; PDB, Protein Data Bank; PLIMS, Protein Laboratory Information Management System; PSI-2, Protein Structure Initiative-2; SDS–PAGE, sodium dodecylsulfate–polyacrylamide electrophoresis; WGA, Whole Genome Amplification. Journal of Structural Biology 172 (2010) 21–33 Contents lists available at ScienceDirect Journal of Structural Biology journal homepage: www.elsevier.com/locate/yjsbi

Upload: others

Post on 27-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

Journal of Structural Biology 172 (2010) 21–33

Contents lists available at ScienceDirect

Journal of Structural Biology

journal homepage: www.elsevier .com/locate /y jsbi

The high-throughput protein sample production platform of the NortheastStructural Genomics Consortium

Rong Xiao, Stephen Anderson, James Aramini, Rachel Belote, William A. Buchwald, Colleen Ciccosanti,Ken Conover, John K. Everett, Keith Hamilton, Yuanpeng Janet Huang, Haleema Janjua, Mei Jiang,Gregory J. Kornhaber, Dong Yup Lee, Jessica Y. Locke, Li-Chung Ma, Melissa Maglaqui, Lei Mao, Saheli Mitra,Dayaban Patel, Paolo Rossi, Seema Sahdev, Seema Sharma, Ritu Shastry, G.V.T. Swapna, Saichu N. Tong,Dongyan Wang, Huang Wang, Li Zhao, Gaetano T. Montelione *, Thomas B. Acton **

Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Robert Wood JohnsonMedical School, and Northeast Structural Genomics Consortium, 679 Hoes Lane, Piscataway, NJ 08854, United States

a r t i c l e i n f o a b s t r a c t

Article history:Received 4 April 2010Received in revised form 24 July 2010Accepted 28 July 2010Available online 3 August 2010

Keywords:Structural genomicsHigh-throughput protein productionConstruct optimizationDisorder predictionLigation-independent cloningMultiple Displacement AmplificationLaboratory Information ManagementSystemProtein Structure InitiativeNMRX-ray crystallographyT7 Escherichia coli expression systemWheat germ cell-freeNMR microprobe screeningParallel protein purification6X-His tagHDX-MS

1047-8477/$ - see front matter � 2010 Elsevier Inc. Adoi:10.1016/j.jsb.2010.07.011

* Corresponding author.** Corresponding author.

E-mail addresses: [email protected] (G.T. Mont1 Abbreviations used: 6X-His, hexa-histidine polypep

hydrogen deuterium exchange with mass spectrometCenters; MALDI-TOF, matrix-assisted laser-desorption-Leukemia Virus; NESG, Northeast Structural Genomics Cpolymerase chain reaction; PDB, Protein Data Bank;dodecylsulfate–polyacrylamide electrophoresis; WGA,

We describe the core Protein Production Platform of the Northeast Structural Genomics Consortium(NESG) and outline the strategies used for producing high-quality protein samples. The platform is cen-tered on the cloning, expression and purification of 6X-His-tagged proteins using T7-based Escherichiacoli systems. The 6X-His tag allows for similar purification procedures for most targets and implementa-tion of high-throughput (HTP) parallel methods. In most cases, the 6X-His-tagged proteins are sufficientlypurified (>97% homogeneity) using a HTP two-step purification protocol for most structural studies.Using this platform, the open reading frames of over 16,000 different targeted proteins (or domains) havebeen cloned as >26,000 constructs. Over the past 10 years, more than 16,000 of these expressed protein,and more than 4400 proteins (or domains) have been purified to homogeneity in tens of milligram quan-tities (see Summary Statistics, http://nesg.org/statistics.html). Using these samples, the NESG has depos-ited more than 900 new protein structures to the Protein Data Bank (PDB). The methods described hereare effective in producing eukaryotic and prokaryotic protein samples in E. coli. This paper summarizessome of the updates made to the protein production pipeline in the last 5 years, corresponding to phase2 of the NIGMS Protein Structure Initiative (PSI-2) project. The NESG Protein Production Platform is suit-able for implementation in a large individual laboratory or by a small group of collaborating investigators.These advanced automated and/or parallel cloning, expression, purification, and biophysical screeningtechnologies are of broad value to the structural biology, functional proteomics, and structural genomicscommunities.

� 2010 Elsevier Inc. All rights reserved.

1. Introduction

The Northeast Structural Genomics Consortium (NESG)1 project(http://www.nesg.org), is one of the four National Institutes of

ll rights reserved.

elione), [email protected] (tide sequence tag; 3D, three-dimenry detection; HMM, hidden Markoinduced time-of-flight; MCS, multiponsortium; NIGMS, National Institu

PLIMS, Protein Laboratory InformaWhole Genome Amplification.

Health (NIH)-funded structural genomics Large Scale Centers(LSC) of the National Institute of General Medical Sciences (NIGMS)Protein Structure Initiative (PSI). The primary goal of these struc-ture production centers is to determine the three-dimensional

T.B. Acton).sional; HCPIN, Human Cancer Pathway Protein Interaction Network; HDX-MS, amidev model; HTP, high throughput; LIC, ligation independent cloning; LSC, Large Scale

le cloning site; MDA, Multiple Displacement Amplification; MMLV, Moloney Mousete of General Medical Sciences; NMR, nuclear magnetic resonance spectroscopy; PCR,

tion Management System; PSI-2, Protein Structure Initiative-2; SDS–PAGE, sodium

Page 2: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

Fig. 1. Protein Sample Production Platform currently used at the NESG. Thisdiagram presents a schematic representation of the bioinformatics (purple);cloning, expression, purification, characterization, and sample preparation (green);structure determination (blue); and salvage strategies (yellow) used by the NESGProtein Sample Production Platform. A.S. – aggregation screening, ES – metric ofExpression and Solubility level.

22 R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33

(3D) atomic-level structures of hundreds of novel proteins and pro-tein domains. The novel structural information generated can thenbe utilized in modeling thousands of additional proteins (or proteindomains). In addition, these centers have a major focus on develop-ment and refinement of new technologies for high-throughput (HTP)protein production, X-ray crystallography, NMR spectroscopy, struc-tural bioinformatics, and related supporting infrastructure. Overall,these centers aim to enrich the biological community by disseminat-ing 3D structural information on important protein domain families,providing access to protein expression systems and protocols forprotein sample preparation, and further enabling research by provid-ing improved technology for the preparation of protein samples.

Nucleic acid-based genomic efforts have the advantage that thebiophysical properties of the macromolecules studied are ratherhomogenous, allowing sample preparation that is highly standard-ized and amenable to high-throughput methods. By contrast, pro-teins often have diverse biophysical properties, making thepreparation of suitable samples more difficult, especially whenconsidering parallel HTP methods. Not surprisingly, one of themost critical issues facing structural genomics is the requirementto provide tens of milligram quantities of soluble, high purity, cor-rectly folded, monodisperse protein samples. Adding additionalcomplexity to this issue is the fact that the NESG Consortium uti-lizes both nuclear magnetic resonance (NMR) and X-ray crystallo-graphic methods for protein structure determination (Montelioneand Anderson, 1999), producing a similar number of structuresby each method. Protein samples suitable for rapid three-dimen-sional (3D) structure determination by NMR generally require13C, 15N, and/or 2H isotope enrichment, while for X-ray crystallog-raphy we generally require selenomethionine labeling. Thereforethe NESG Protein Production Platform must be flexible enough tohandle preparation of protein samples for both crystallization/crystallography and for heteronuclear NMR studies. Consideringthese challenges, one of the major contributions of the NESG isthe development of new technologies in the areas of proteinexpression and purification to deliver protein samples suitablefor both NMR and X-ray crystallography.

Here we describe our HTP cloning, protein expression and pro-tein purification pipeline. This article emphasizes the recent tech-nological advances that have been made during PSI-2, and buildson previous work describing our pipeline (Acton et al., 2005). Thissystem is primarily based on Escherichia coli T7 expression systems,which has to date proven to be the most productive, most efficient,and least expensive method to produce the quantities of proteinrequired for structural studies. The description of this platform in-cludes target selection, construct optimization, ligation-indepen-dent cloning (LIC), analytical scale expression and solubilityscreening, Midi-scale expression, purification and biophysicalcharacterization and large-scale protein sample production(Fig. 1). Protein targets of the NESG project are either full-lengthproteins or domain constructs. Currently, each week over one hun-dred protein targets are cloned and screened for expression, 50–75expression constructs are fermented on a preparative (1–2 liter)scale, and roughly 30–40 targets are purified in tens of milligramquantities for biophysical characterization, including NMR and/orcrystallization screening. This platform is both scalable and porta-ble, and can be readily implemented by traditional structural biol-ogy laboratories, biotechnology industry, and various proteomicsand functional genomics projects.

2. Bioinformatics infrastructure and target curation

Protein targets, either full-length proteins or domain constructs,for structure determination are derived from three sources. Thebulk of targets for the PSI LSCs are selected by a centralized PSI bio-

informatics committee, including bioinformatics scientists nomi-nated by each of the LCSs, and distributed among the four centers(Dessailly et al., 2009). These generally constitute large protein do-main families with numerous members that have not been struc-turally characterized (BIG families), very large protein familieswith limited structural coverage (MEGA families), and domain fam-ilies selected from metagenomic projects (META-families) such asthe human gut microbiome project (Gill et al., 2006). The overallgoal of targeting large protein domain families is to provide thegreatest novel leverage of structure space per target (Nair et al.,2009). Consequently this allows for pan-genomic targeting, takingadvantage of the sequence differences and their concomitant bio-physical characteristics within a domain family to isolate familymembers amenable for structure determination (Liu et al., 2004;Acton et al., 2005; Punta et al., 2009). Each LSC also is responsiblefor targets from a biomedical theme. The NESG pursues proteinsfrom the Human Cancer Pathway Protein Interaction Network(HCPIN) (Huang et al., 2008) that we develop and curate (http://nesg.org:9090/HCPIN/). This is a collection of proteins involved incancer associated signaling pathways and biological processes, to-gether with their associated protein–protein interaction partners.Finally, the biomedical community nominates targets to the centralcommittee, which distributes these Community Nominated Targetsto the various PSI LSCs. Although protein target families are derivedfrom these many sources, the focus of the NESG is on domain fam-ilies represented in eukaryotic proteomes, including families thathave exclusively eukaryotic members (e.g. the Ubiquitin DomainMega family) and families that have both eukaryotic and prokary-otic members (e.g. the Start Domain Mega family).

One of the major goals of structural genomics is to increase theefficiency of structure production. More specifically, in the area ofprotein production, both experimental and bioinformatics studieshave been published describing efforts to identify parametersand procedures that correlate with success, such as high levels ofprotein solubility or clone to PDB deposition rates (Dyson et al.,2004; Goh et al., 2004; Graslund et al., 2008a; Slabinski et al.,2007). We have developed numerous bioinformatics tools for thepurpose of identifying the members of a protein domain family

Page 3: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33 23

that are most amenable to protein production and structure deter-mination. It is clear that the variation in protein sequence within afamily can have great effects on its behavior with respect to proteinproduction and biophysical properties. Using our extensive dataset of proteins prepared in a similar fashion, we have identified pri-mary sequence traits that correlate with (i) high levels of proteinexpression (E) and solubility (S) in our bacterial expression sys-tems (PES) (unpublished results), (ii) greater probability of crystalstructure determination based on protein sequence (PXS) (Priceet al., 2009) and (iii) greater probability of amenability to NMRstructure determination (PNMR) (unpublished results). These tools,together with our pan-genomic targeting strategy using our exten-sive list of over 175 Reagent Genomes (fully sequenced archeal,bacterial, and eukaryotic genomes and the corresponding geneticmaterial for cloning) allows us to select several (4–6) proteins fromeach family for protein production by identifying those that aremost likely to succeed.

Although we have made great efforts to enrich our protein pro-duction pipeline with amenable targets, one of the greatestenhancements to our pipeline in PSI-2 is our NESG Construct Opti-mization Software. A highly homogeneous protein sample withminimal numbers of disordered nonnative residues is generally re-quired for successful protein crystallization and structure determi-nation by X-ray crystallography (Sharma et al., 2009). While NMRcan often be used successfully to study even fully disordered pro-teins, disordered segments of proteins can cause them to aggre-gate, and can be deleterious to NMR spectral quality. In addition,many targets are within multidomain proteins, which oftenmisfold in prokaryotic systems (Netzer and Hartl, 1997), manymultidomain proteins are also beyond the size limitations forhigh-throughput NMR structural determination techniques. Tocircumvent these problems domain parsing is often required.Obtaining soluble well-behaved domains with minimal disorderedregions is challenging, and often cannot be accurately predicted.The NESG and others have taken an approach of producing severalalternative constructs varying the termini of a targeted domain toidentify the most amenable sequence (Graslund et al., 2008b;Chikayama et al., 2010). Briefly, the construct optimization soft-ware uses reports from the NESG DisMeta server, a metaserverproviding a consensus analysis of sequence-based disorder predic-tors to predict disordered regions (www-nmr.cabm.rutgers.edu/bioinformatics/disorder), identify predicted secretion signalpeptides, trans-membrane segments, possible metal binding sites,secondary structure, and interdomain disordered linkers. Thesestructural bioinformatics data, together with multiple sequencealignments of homologous proteins and hidden Markov models(HMMs) characteristic of the targeted protein domain families(Dessailly et al., 2009) are used to identify possible structuraldomain boundaries. Based on this information, the software gener-ates nested sets of alternative constructs, for full-length proteins,multidomain constructs, and single domain constructs. Thus for asingle targeted region, we generally design multiple open readingframes varying the N and/or C-terminal sequences. Compared toonly pursuing full-length proteins, these alternative constructsoften possess significantly better expression, solubility and bio-physical behavior, increasing the likelihood of success in crystalli-zation and the efficiency of structure production.

Each of the proposed constructs are reviewed by a bioinformat-ics expert, and targets that pass this review are entered into ourProtein Laboratory Information Management System (PLIMS). ThisJAVA-based Oracle database provides a detailed protein productiondata model, integrating closely with activities in the lab. A web-based application, PLIMS consists of four main modules: (i) targetregistration and management, (ii) molecular biology and proteinexpression, (iii) large-scale fermentation, and (iv) protein purifica-tion. It is designed to capture all the information needed to com-

pletely reproduce the protein sample production process,interfacing where possible with robotics, and utilizing bar codes,PDAs, and wireless technology. Data from PLIMS is then uploadedto the internet-accessible NESG SPINE Structure Production Data-base (Bertone et al., 2001; Goh et al., 2003) to be shared acrossthe consortium and with public databases.

Alternative construct DNA sequences are generated in the PLIMSdatabase in a 96-well format. These sequences are then entered intothe NESG Primer Prim’er software for automated primer design(Everett et al., 2004). This freely available web-based software(http://www.nesg.org/primer_primer) generates vector specificPCR primer sets designed to amplify and insert DNA targets into avector of choice. Usually this vector is part of our ‘‘Multiplex VectorKit’’, a series of vectors with a common multiple cloning site de-signed to minimize the number of nonnative residues while addinga 6X-His tag (Acton et al., 2005). Affinity tags are generally requiredfor high-throughput purification protocols (Sheibani, 1999; Croweet al., 1994) however, large disordered tags found with many com-mercial vector systems can interfere with structural determinationefforts. Although both restriction endonuclease and viral recombi-nation cloning strategies are supported in Primer Prim’er, we designORF-specific primers with vector overlap regions for use with InFu-sion (Clonetech) ligation-independent cloning (LIC). Predominantlywe clone into NESG-modified pET15 or pET21 T7 expression vectorderivatives with N- (MGHHHHHHSH–) or C- (–LEHHHHHH) 6X-Hisaffinity purification tags, respectively. The primer information in96-well format is then entered into PLIMS, which produces the or-der forms for our oligonucleotide vendor.

3. High-throughput cloning for E. coli expression

3.1. Methods for the production of PCR template

The first step in the cloning of our structural genomics targetsinvolves PCR amplification of gene regions targeted in the con-struct design process described above. Oligonucleotide primers de-signed with Primer Prim’er are easily procured from a variety ofvendors at inexpensive rates. However, PCR templates are not eas-ily procured and are often expensive, and in the case of genomicDNA preparations from prokaryotic targets, of limited quantity.Further, our focus on eukaryotic protein families has the addedcomplication that we must use cDNA in most cases in order toclone eukaryotic targets. Here we outline two alternative methodsfor generating template DNA for HTP 96-well PCR reactions.

The number of fully sequenced prokaryotes has increased at arapid rate resulting in the elucidation of the genomic sequencesof over 1000 organisms, with even more in progress. As methodsto predict success in expression, solubility, crystallization, andNMR spectral quality, based on primary sequence, are refined(Price et al., 2009), it becomes possible to increase efficiency byselecting target proteins and domains from large domain familiesthat are most likely to be successful. The protein sequences thatarise from these prokaryotic sequencing projects are a rich sourceof targets that may be amenable to structural determination. How-ever, genomic DNA preparations are commercially available foronly a small fraction (�10%) of the sequenced prokaryotic strains.It is possible to produce genomic DNA by direct extraction fromcultures, however, many strains require specialized media andgrowth conditions which make such a strategy difficult and expen-sive. To circumvent these problems we have implemented WholeGenome Amplification (WGA) by Multiple Displacement Amplifi-cation (MDA) utilizing phi29 DNA polymerase (Lasken, 2007,2009; Kvist et al., 2007), to produce microgram quantities of geno-mic DNA suitable for use as cloning template (Dean et al., 2002).WGA by MDA is routinely used in metagenomic/environmental

Page 4: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

24 R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33

genome sequencing projects to prepare DNA templates from min-ute quantities of cells (Kvist et al., 2007; Lasken, 2009). As the vastmajority of sequenced bacterial strains are available as inexpensivelyophilized cultures from ATCC (American Type Culture Collec-tion), we routinely perform MDA on a small aliquot of freeze-driedcells to provide genomic DNA suitable for use as PCR template incloning NESG target genes. This high-fidelity technique has provento be extremely robust, and has successfully generated genomicDNA for more than 30 new bacterial and archaeal Reagent Gen-omes, including a number of human gut metagenome species,greatly expanding the range of proteomes that we can target (Ac-ton et al., in preparation).

One major advantage of bacterial targets for HTP cloning is thefact that they do not contain introns in their coding sequences.Therefore, genomic DNA can be used as template for PCR amplify-ing the coding region of a target for subsequent cloning into a bac-terial expression vector. In a high-throughput setting, this also hasthe added advantage that a PCR ‘‘master mix”, containing the geno-mic DNA as template, can be added to multiple reactions, savingtime and robotic liquid handling tips.

Conversely, eukaryotic organisms often have introns in the cod-ing regions of their genes. E. coli does not have the ability to splicemRNA transcripts prohibiting the use of genomic DNA as PCR tem-plate for eukaryotic targets, a cDNA source is then necessary foramplification. There are numerous commercially available sourcesfor cDNA. However, many have significant problems with themajority lacking full-length sequence verification, as such theycan often contain polymorphisms. While there are full-length, fullysequenced cDNA sources such as the ORFeome collaboration clones(Open Biosystems) (Rual et al., 2004; Lamesch et al., 2007), thesereagents are quite costly and these collections are not complete.Further, these individual clone libraries have logistical issues, theymust be archived and rearrayed, before use as PCR template. In or-der to circumvent these problems, we have taken the approach ofproducing cDNA pools from various cell types using commerciallyavailable mRNA preparations (Clontech). In this strategy we usepolyadenylated mRNA from various tissues, cell types, and devel-opmental stages, including a considerable number of tumor cellsand human cell lines, together with oligo dT or random primers,to carry out MMLV-mediated reverse transcriptase reactions (Ac-ton et al., 2005). These cDNA pools are then mixed together andused as a common template that is added to each PCR reactionwith target specific primers much like using bacterial genomicDNA. This greatly increases throughput and allows us to targetgenes that may not be in the available cDNA libraries. This strategyis quite effective. Analysis of our PLIMS database indicates that 88%of our GC-rich (>59% GC content) and 96% of lower GC content RT-PCR amplification products are of the correct size. Although thisapproach may also generate clones with polymorphisms, we findit to be cost effective and more amenable to HTP in comparisonto cloning from commercially available cDNA sources. Recentlywe have expanded this strategy from mainly human targets to in-clude Bos taurus, Mus musculus, Rattus norvegicus and Aribidopsisthaliana among others.

3.2. Ligation-independent cloning (LIC) and automated vectorconstruction with the Qiagen BioRobot 8000

The first step in the HTP production of proteins is the construc-tion of vectors for expression of the target proteins. The NESG ini-tially developed HTP approaches to cloning utilizing classicalrestriction endonuclease/ligase-dependent methods in combina-tion with our Multiplex Cloning Vector Set implemented in 96-wellformat using a BioRobot 8000 (Acton et al., 2005). The vector sys-tem we created was designed to minimize the number of nonna-tive residues in the open reading frame while adding a 6X-His

tag. Using this robust strategy we have cloned nearly 7000 targetprotein (or domains). Ligation-independent cloning systems (LIC)are generally far more efficient, less time consuming and requireless technical skill than ligase-dependent cloning (Aslanidis andde Jong, 1990; Haun and Moss, 1992). However, our view at thestart of the project was that although the LIC technologies werepromising, the technology was not developed enough in 2000 tomeet the needs of a structural genomics project. For example, mostof the early technologies resulted in the addition of a large numberof nonnative residues to a protein coding sequence, which is notdesirable for crystallization or NMR studies.

Although we have had great success with our classical system ofcloning, LIC systems always held promise for our HTP applications.Recent advances, such as the InFusion cloning system (Clonetech)have negated the previous drawbacks. During PSI-2 we adaptedthe InFusion strategy to our HTP cloning pipeline. InFusion cloningonly requires the addition of a 15 base pair tail to each of the genespecific PCR primers for a given target ORF; these base pairs arecomplimentary to the 50 and 30 regions of the vector multicloningsite, respectively (Zhu et al., 2007). After PCR amplification, theORF DNA fragment now containing the region of vector overlapis incubated with the vector and the InFusion enzyme for 30 minand directly transformed into bacterial competent cells, resultingin a protein expression clone. LIC competent vector is producedby restriction endonuclease digestion in a nearly identical manneras described for ligase-dependent cloning (Acton et al., 2005).Briefly digestion with NdeI is followed by XhoI digestion, agarosegel purification, gel extraction and finally the concentration is nor-malized to 8 ng/ll. Further vector treatment against self-ligation isnot necessary since the Infusion enzyme does not have ligase activ-ity, and ligation by host enzymes appears inefficient with the min-ute overhangs produced by restriction digest. The greatestadvantage of the InFusion method is the substantial decrease inthe number of cloning steps and the overall high efficiency of clon-ing. The restriction digest steps, long overnight ligation reactions,and several purification steps are no longer necessary. This resultsin a dramatic time savings, allowing the same number of techni-cians to nearly double their cloning output. In addition, this strat-egy is completely compatible with our Multiplex Cloning VectorSet (Acton et al., 2005), using the same exact vectors and the samestrategy for minimizing nonnative residues. The removal of therestriction digestions steps also allows those ORFs with the mostfavored restriction sites internal to their coding sequence to becloned while minimizing nonnative residues. Our modificationsto the InFusion cloning system also allows this strategy to be costefficient and actually below the cost of our ligase-dependent sys-tem. Using this new method we have cloned over 20,000 con-structs of some 9000 unique protein targets (multiple alternativeconstructs per target) into pET expression vectors, a dramatic in-crease in our previous rate of cloning.

We have automated each step of our vector construction strat-egy using a BioRobot 8000 to allow high-throughput cloning in a96-well manner. Fig. 2 outlines each of these steps of vector con-struction. Steps shown in blue typeface are automated while thosein red are semiautomated, requiring some manual manipulations.A detailed protocol of the entire process can be downloaded(http://www-nmr.cabm.rutgers.edu/labdocuments/proteinprod/index.htm) and has been previously described (Acton et al., 2005).Each automated step is controlled by a custom Qiasoft 4.1 programdeveloped in house. Initially, 50 lM concentrations for forwardand reverse primers for each specific ORF (identical wells on twoseparate 96-well blocks) are placed on the BioRobot. From a sepa-rate position, the eight-channel pipette head transfers an appropri-ate PCR reaction mix to each well in a 96-well PCR plate [dNTPs,Advantage HF2 high-fidelity polymerase and buffer (Clontech),template DNA, and nuclease free H2O]. The BioRobot then transfers

Page 5: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

PCR Reaction

PCR Purification

Tra

nsfo

rmat

ion

Col

ony

PC

R

Lig

atio

nIn

depe

nden

tC

loni

ng

96 Expression Vectors

PCR samples

Bind

Wash

Elute

Pure PCR products

Vacuum

Vacuum

Vacuum

Overnight cultures

ResuspendLyse Transfer

Filter

BindWash

Elute

TurboFilter 96

QIAprep 96

Fig. 2. Schematic of the cloning process using the Qiagen BioRobot 8000. Each stepin the cloning strategy is indicated. Blue type denotes steps that are completelyautomated, and red type indicates steps that require some manual input.Procedures of Qiagen-based protocols were modified, including Qiaquick Purifica-tion and DNA Mini-Prep protocols. However, most have been designed in the NESGProtein Production laboratory. A more detailed description of the robotic cloningprocedure, as well as the automated protocols are provided elsewhere (Acton et al.,2005).

R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33 25

100 pmol of the appropriate forward and reverse primers from theprimer blocks into the corresponding well for each target in thePCR plate. A variety of Applied Biosystems thermocyclers are usedfor amplification with 35 total cycles. Each cycle contains a 10 s94 �C melting step, a 20 s annealing step (50–55 �C), and a 3 min68 �C elongation step. An annealing temperature step increaseafter 10 rounds of amplification is included taking advantage ofthe increased stability derived from the added recombination sitesbase pairs (Acton et al., 2005).

Our expansion of Reagent Genomes during PSI-2 has also in-creased the number of GC-rich genomes. As the GC content in-creases, PCR amplification becomes more problematic. In order tocircumvent this problem, alternative thermostable polymerasesand buffer conditions (such as the addition of DMSO) must be uti-lized. Care must be taken to adjust buffer and annealing tempera-ture conditions to maximize fidelity while increasing the likelihoodof obtaining amplification product. GC-rich templates are often aproblem with eukaryotic genes and although we have great suc-cess with the Advantage GC 2 Polymerase (Clontech), higher errorrates will occur.

PCR products are visualized and separated on a 2% agarose gel,followed by Alpha Imager (Cell Biosciences) documentation, andentry into the PLIMS data management system. DNA fragmentsof the correct size are excised from the gel with a SafeXtractorand relocated into the appropriate well of a 96-well S-Block (Qia-gen). Using reagents from the Qiagen Gel Extraction Kit and a QIA-quick 96-well column PCR Cleanup plate, an automated 96-well gelextraction is performed on the BioRobot 8000. The resulting puri-fied PCR products are then subjected to LIC cloning into pETexpression vectors, as described above. Following the InFusion en-zyme activity, the resected and paired DNA fragments (vector andinsert) are transformed into E. coli cells, using a 24-well format ro-botic transformation procedure. Briefly, a single microliter LICproduct is transferred to the corresponding well of a fresh 96-wellPCR plate prechilled at 0 �C on the robot deck. Each well of thisplate contains 10 ll of XL-10 ultracompetent cells (Agilent). A

transformation procedure is then carried out on the robot deckkeeping the PCR plate at 0 �C until a manual heat shock. SOC(100 ll) is added to each well, and the plate is incubated at 37 �Cfor 1 h. The entire content of each well is transferred to a corre-sponding well in one of four 24-well blocks. The robot’s platformshaker spreads the mix via the 5–10 (3-mm-diameter) glass beadsover the 2 ml of Luria broth (LB) medium/Agar with ampicillin ineach well. Following overnight incubation at 37 �C, two coloniesper ORF are harvested for colony PCR, using primers flanking themultiple cloning site (MCS). The results are visualized by agarosegel electrophoresis, documented into PLIMS, and the correct clonesare subcultured overnight. Plasmid DNA is isolated using a com-pletely automated Qiagen 96-well DNA mini-prep procedure andboth the cultures and DNA constructs are archived in an NESG Re-agent Repository.

4. Protein expression, solubility, and biophysicalcharacterization

4.1. Analytical scale expression

The goal of analytical scale expression is to measure theexpression and solubility level of each construct. Depending onthe source of the protein target, roughly 30–50% of the clones willexpress soluble protein at the level needed for large-scale (1–3liters) fermentation in shake flasks, and purification. With thisattrition rate, preparative-scale fermentation of each clone isnot feasible. We have therefore developed a plate-based strategyto evaluate expression (E) and solubility (S) in a HTP fashion,while maintaining the highly aerated growth conditions foundin later fermentation efforts. Fig. 3 outlines this process startingwith transformation into the codon-enhanced BL21(DE3)pMgKstrain using a robotic transformation protocol and 24-well plates.Following overnight growth, individual colonies are inoculatedinto the corresponding well of a 96-well block containing 0.5 mlof LB per well. The pre-culture is incubated for 6 h at 37 �C, andpreserving well assignment, subcultured robotically into a fresh96-well block containing 0.5 ml of MJ9 minimal media (Janssonet al., 1996) for overnight growth. Growth in this same minimalmedia will be utilized in preparative-scale fermentations for iso-tope or selenomethionine enrichment. We have found thatgrowth under minimal media conditions differs significantly fromrich media, often affecting expression and solubility behavior. TheBioRobot performs a 1:20 dilution of the saturated growth intoone of four 24-square-well blocks (10 ml maximum volume/well)containing 2 ml of MJ9 media, preserving well assignment. Eachblock is sealed, covered with Airpore tape (Qiagen), and grownto mid-log phase (2–3 h growth, 0.5–1.0 OD600 units) with vigor-ous shaking at 37 �C. Expression is then induced with 1 mM IPTG,the temperature is shifted to 17 �C, and the cultures are grownovernight with vigorous shaking. The low temperature incubationoften aids in producing soluble proteins (Shirano and Shibata,1990), while the vigorous shaking with gas permeable tape allowsfor greater aeration rates like those that we obtain in our Midi-scale fermentor (described below) or large-scale fermentation inbaffled flasks. Following overnight induction, cells are harvestedby centrifugation, the pellets are resuspended in lysis buffer(50 mM NaH2PO4, 300 mM NaCl, 10 mM 2-mercaptoethanol)and robotically transferred to a 96-well PCR plate. A 96-probesonicator (Misonix) is used for cell disruption. Total and solubleportions of the cell lysate are visualized by SDS–PAGE. Expression(E) and solubility (S) are scored, each on a scale of 0 (none) to 5(max); i.e. the E � S (or ES value) ranges from 0 to 25. All data isdocumented in the PLIMS system.

Page 6: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

Transformation

Har

vest

Plat

eC

entr

ifuge

Overnight Culture96 Well Transfer (MJ9)

96-Well Culture

24-well blocks96 to 24 Well Transfer

O/N

@17

o C

Induction

37o C

96-Well Plate

24-Well Block

2.2 ml S-Block

2.2 ml S-Block37o C

37o C

BL21

LB

96 Probe Sonicator

Tot Sol Tot Sol Tot Sol Tot Sol Tot Sol

SR62 SR63 SR64 SR65 SR68

GNF Midi-Scale/Large-Scale FermentationOD600 ~0.4-0.8 IPTG

Plate Centrifuge

Fig. 3. High-throughput analytical scale protein expression screening using robotic methods. This schematic shows the step-by-step procedure used for small-scaleexpression screening. Completely automated steps are shown in blue, and partially automated steps are shown in red. Briefly, initial cultures are grown in 2.2 ml 96-well S-blocks (Qiagen), followed by subculturing in 24-well blocks (Qiagen). Following overnight incubation the cultures are transferred into two separate S-blocks (1 ml perrespective well) and harvested by centrifugation (3000 � g, 10 min). The media is discarded and the cell pellet is resuspended in 100 ll of lysis buffer and transferred to a 96-well round bottom plate (Greiner). Following sonication a 30 ll aliquot of the total cellular lysate (Tot) is transferred to a new plate. The remainder is centrifuged for 10 min at3000 � g, and a 30 ll aliquot of the supernatant (Sol) is transferred to a new plate. Equal amounts of Tot and Sol are added to adjacent wells for SDS–PAGE analysis.

26 R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33

4.2. Midi-scale protein production and characterization

Although all expression constructs with high expression andsolubility levels (e.g. ES > 11) can be scaled-up on a preparativescale, a large fraction of the resulting samples turn out to be aggre-gated or even unfolded following preparative purification. Asshown in Table 1, retrospective analysis of our earlier extensivedata set (>1500 purified proteins) demonstrates that crystalliza-tion success rates are dramatically increased more than 10-foldfor monodisperse protein samples in comparison with those poly-disperse or aggregated (Price et al., 2009). Based on these results,and in order to maximize efficiency, we have developed a HTPMidi-scale Protein Production Pipeline, allowing production of hun-dreds-of-microgram quantities of protein, sufficient to characterizethe biophysical properties of protein constructs before investing inlarge-scale expression and purification (Fig. 4). This system utilizes(i) a 96-tube GNF fermentor (Genomics Institute of the NovartisResearch Foundation) with O2 aeration allowing for high cell den-sity protein expression at 60 ml scale; (ii) a His MultiTrap HP 96-

Table 1Analysis of crystal hits from NESG protein samples with various monodispersity (2001–20

Year Monodisperse Predominantly monodisperse Mostly

2001 1/2 (50.0%)2002 24/82 (29.3%) 0/4 (0.0%)2003 14/52 (26.9%) 1/3 (33.3%) 1/1 (102004 42/112 (37.5%) 6/19 (31.6%) 0/11 (02005 37/148 (25.0%) 2/31 (6.5%) 1/9 (112006 37/223 (16.6%) 14/41 (34.1%) 2/32 (62007 47/277 (17.0%) 7/57 (12.3%) 0/26 (0

Total 202/896 (22.5%) 30/155 (19.4%) 4/79 (5

Proteins with crystal hits/proteins provided for crystallization screening (% crystal hits)Monodisperse: >90% monodispersity.Predominantly monodisperse: >80% but <90% monodispersity.Mostly polydisperse: >50% but <80% monodispersity and <3 peaks.Polydisperse: <50% monodispersity with >3 peaks.Indeterminate: protein not in void volume (Vo), but obscured by ring-down from Vo.Aggregated: protein in Vo.

well plate (GE Healthcare) for Ni-affinity protein purification;and (iii) Zeba™ 96-well desalting spin plate (Thermo Scientific)for buffer exchange. Typical yields of 0.2–1.0 mg of protein per60 ml fermentation are achieved, with 96 fermentations carriedout in parallel. These quantities of purified protein are sufficientfor a series of analytical protein chemistry steps including: aggre-gation screening by analytical gel filtration with static light scatter-ing (Acton et al., 2005), homogeneity analysis using Calipermicrofluidics, target validation by MALDI-TOF mass spectrometry,concentration determination by a NanoDrop ND-8000 spectropho-tometer, and NMR screening using a 1.7-mm micro cryo NMRprobe (35 lL sample volume). Identification of aggregated/polydis-perse proteins prior to scale-up allows us to screen multiple con-structs of a target in order to find those most likely to succeed incrystallization and/or NMR experiments. NMR screening (Rossiet al., 2010), requiring 100–300 lg samples of protein, allows spec-tral evaluation by 1D 1H NMR prior to isotopic enrichment. Further,the Midi-scale process avoids scale-up of intractable protein tar-gets, greatly increasing project efficiency.

07).

polydisperse Polydisperse Indeterminate Aggregated

0.0%) 0/1 (0.0%).0%) 0/110 (0.0%) 0/35 (0.0%).1%) 0/24 (0.0%) 0/20 (0.0%) 0/27 (0.0%).3%) 2/30 (6.7%) 1/46 (2.2%) 1/25 (4.0%).0%) 0/19 (0.0%) 1/69 (1.4%) 1/70 (1.4%)

.1%) 2/183 (1.1%) 2/135 (1.5%) 2/158 (1.3%)

.

Page 7: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

GNF system(60 ml)

Purification using96-well plate withNi-NTA superflow

buffer exchangeusing 96-well

desalting plate

ND 8000(NanoDrop)

Lab Chip 90(Caliper)

MALDI-TOF (Appl. Biosys)

HPLC 1200 Series (Agilent)MiniDawn+Optilab (Wyatt)

NMR Microprobe Screening

Concentration Purity MW Aggregation Screening

(0.2—1.0 mg, 4-10 µg/µl)

Supernatant in 96-well plate

35 µl 1 µl 2 µl 1 µl 15 µl

Fig. 4. Midi-scale 96 sample protein expression, purification, and characterization. This system utilizes (i) a 96-tube GNF fermentor (Genomics Institute of the NovartisResearch Foundation) with O2 aeration at 60 ml scale; (ii) a His MultiTrap HP 96-well plate (GE Healthcare) for Ni-affinity protein purification; and (iii) Zeba™ 96-welldesalting spin plate (Thermo Scientific) for buffer exchange. Analytical protein chemistry steps include aggregation screening by analytical gel filtration with static lightscattering, homogeneity analysis using Caliper LabChip� 90 system, target validation by MALDI-TOF mass spectrometry, concentration determination by a NanoDrop ND-8000 Spectrophotometer, and NMR screening using a 1.7-mm micro cryo NMR probe.

R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33 27

4.3. Midi-scale fermentation with GNF fermentor system

To produce sufficient quantities of protein for biophysicalcharacterization we have recently adapted a GNF 96-well fer-mentor (Genomics Institute of the Novartis Research Foundation)to our Midi-scale pipeline. Using rich TB media (Peti et al., 2005),we routinely reach cell densities in the range of 15–20 OD600

units. This correlates to a quarter of the final cell mass obtainedfrom 1 l of our large-scale protein expression in minimal media,which roughly produces to 3–5 OD600 units. Briefly this proce-dure starts with placing 500 ll of TB media with ampicillinand kanamycin into each well of a 96-well block. Expressionclones scored with high (>11) ES values are robotically trans-ferred from their plate-based glycerol stocks into a PLIMS-direc-ted unique well. Following an overnight incubation, the entirecontents of each well are then transferred to a 100 ml test tubein the corresponding position of the GNF fermentor. Each tubecontains 57 ml of TB and anti-foam, the air intake manifold is in-serted into the rack of 96-tubes and placed in a water bath pre-heated to 37 �C. Using the manifold and its canulae, 100% oxygenis distributed to each well at a flow rate of �3.5 cfm. This pro-vides oxygen for growth as well as agitation for mixing the cul-ture. We have found that the dual functioning canulaenecessitate a high percentage of oxygen addition for the greatestyield, in turn requiring adequate system ventilation for safety.When OD600 reaches 5–6 units, IPTG is added for a final concen-tration of 1 mM. Concurrently, the water bath temperature is de-creased to 17 �C using a refrigerated water circulator (VWRScientific). Following 16 h of incubation at this temperatureand aeration with 100% oxygen, an aliquot is taken from eachwell to assay final cell density and for SDS–PAGE analysis ofexpression and solubility levels, and each is transferred to a la-beled 50 ml conical tube, and centrifuged. The resulting data isdocumented in the PLIMS database.

4.4. Ni-affinity protein purification using His MultiTrap HP 96-wellplate

Cell pellets from each culture are resuspended in lysis buffercontaining 1� cell lytic B, 500 mg/ml lysozyme (freshly prepared),100 units/ml RNAse, 100 units/ml DNAse, and 40 mM imidazole.Following a shaking incubation at 37 �C for 30 min, cell debris iscleared by centrifugation at 3000 rpm for 20 min. Two millilitersof each resulting supernatant is transferred to an empty 2.2-mldeep-well plate (Qiagen S-block). A Liquidator96 (Rainin) is usedto transfer 400 ll from each well to the corresponding well of aHis MultiTrap HP 96-well plate (GE Healthcare) for Ni-affinity pro-tein purification. The plate is centrifuged for 4 min at 100 � g, theflow through is discarded, and the process repeated four moretimes to load the entire contents of each well. Each well in theNi-IMAC plate is washed three times with 500 lL of lysis buffercontaining 40 mM imidazole (pH 7.5). Proteins are next eluted byadding 75 ll of lysis buffer containing 300 mM imidazole (pH7.5) to each well, the plate is then incubated at room temperaturefor 5 min and centrifuged at 100 � g for 4 min. The Ni-affinity puri-fied proteins are then immediately transferred to a Zeba™ 96-welldesalting spin plate (Thermo Scientific) for buffer exchange intoappropriate buffers for biophysical characterization.

4.5. Biophysical characterization

4.5.1. Homogeneity analysis using Caliper LabChip� 90 systemTo assay the purity of proteins from the Midi-scale purification

we have incorporated the use of a LabChip� 90 system (Caliper).This microfluidic device uses the same electrophoresis separationprinciple as SDS–PAGE (Bousse et al., 2001). However, the Lab-Chip� 90 system has higher sensitivity, lower volume (1–2 ll)requirements, 96-well format compatibility with the BioRobot8000, and is less time consuming (�90 min per plate). These make

Page 8: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

28 R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33

it ideal for our Midi-scale purification and characterization plat-form. Briefly, samples are prepared in a 96-well plate by mixing2 ll of protein sample with 7 ll of denaturing buffer. Followingheat denaturation (5 min @ 95 �C), 35 ll of water is added to eachwell. The LabChip 90 automation system then loads the Protein Ex-press chip, following separation and detection the software reportsthe size, relative concentration and purity of the proteins. Althoughthis system provides high-quality results, lower molecular weightproteins (<12 kDa) cannot be accurately analyzed with this system.All data is archived in our PLIMS database.

4.5.2. Target validation by MALDI-TOF mass spectrometrySamples are prepared by mixing 1 ll of the protein sample from

each well with 10 ll of sinapinic acid (SA) matrix solution (10 mg/ml SA in 50% acetonitrile/50% 0.1% TFA). Spectra are collected foreach protein spot, corresponding to a well, on a MALDI-TOF/TOF(ABI-MDS SCIEX 4800) in single TOF mode. The spectrum of eachwell is compared to the expected size of the purified protein; spe-cies differing from their expected mass by greater than 500 Dalikely represent invalid targets, and are further investigated in or-der to validate the protein sequence.

4.5.3. Aggregation screening by analytical gel filtration with static lightscattering

It is now well established that proteins that are monodispersein solution are more likely to produce crystals during screening tri-als than polydisperse or aggregated samples (Klock et al., 2008;Ferre-D’Amare and Burley, 1994, 1997; Price et al., 2009). Analyti-cal gel filtration followed by multi-angle static light scattering(SEC-LS) is an extremely sensitive method for detecting the distri-bution of oligomers and/or aggregates in a protein sample. Briefly,an Agilent 1200 series HPLC system with an automated 96-wellsample changer is used with a Shodex KW-802.5 HPLC size-exclu-sion column to separate the protein species in solution. A mini-DAWN TREOS detector (Wyatt technologies) simultaneouslymeasures light scattering at three different angles (45�, 90�, and135�). Refractive index is also measured using an Optilab rEXRefractometer (Wyatt Technology). Together, the analysis of thisdata provides the shape-independent weight-average molecularmass of each species in the gel filtration effluent, and their relativedistributions. As shown in the top panel of Fig. 5, the light scatter-

Strip Chart – H

DimerOligomer

Aggregated

Valu50

0.70.60.50.40.30.20.10.4

0.3

0.2

0.1

0.0

Det

ecto

r: A

UX1

D

etec

tor:

2

Fig. 5. Aggregation screening using analytical gel filtration with static light scattering. Dak = 690 nm and at 30 �C on a sample of target HR3580C. The elution profile as detectedillustrated.

ing trace for NESG target HR3580C indicates peaks correspondingto monomer, dimer, higher oligomers, and aggregates of the pro-tein. The bottom panel traces the refractive index and indicatesthat the majority of mass is contained as a monomer. Further dataanalysis indicates that roughly 75% of the mass is monomeric, withsignificant mass in other species. This suggests that further con-struct optimization or other ‘‘salvage” efforts are required beforepromotion to large-scale fermentation and purification.

4.5.4. Concentration determination by a NanoDrop ND-8000spectrophotometer

Traditional spectrophotometry requires placing samples intocuvettes or capillaries. This is impractical due to the limited samplevolumes generated by the Midi-scale system. The NanoDrop™8000 spectrophotometer enables the quantification of samples involumes as low as 0.5–2 ll without dilution. Using an eight-chan-nel pipette to transfer the purified proteins from the 96-well plateto a linear array (96-well spacing) of pedestals allows the measure-ment of 96 samples in less than 6 min. The protein concentration ineach well is calculated automatically using its respective extinctioncoefficient. The accurate protein concentrations derived from thisassay are used for the light scattering data analysis, sample prepa-ration for NMR screening, and for calculating process yield. All ofthe information generated in this step is recorded in the Spinedatabase.

4.5.5. Microprobe NMR screeningRecent advancements in NMR microprobe technology have

greatly decreased the amount of protein necessary for study. Typ-ically, only 10–200 micrograms of protein in a volume of 35 ll issufficient for screening with a Bruker 600 MHz and TXI 1.7 mmMicroCryoprobe. Our microscale protein NMR sample screeningpipeline has been discussed elsewhere (Zhang et al., 2008; Rossiet al., 2010). In the context of the Midi-scale expression and puri-fication procedure, there are a few changes. The 96 purified pro-teins are buffer exchanged into NMR buffer (typically, 20 mMMES, 200 mM NaCl, 10 mM DTT, 5 mM CaCl2 at pH 6.5) using aZeba™ 96-well desalting spin plate (Thermo Scientific) and ali-quots are transferred to 1.7 mm SampleJet Tubes (Bruker) using aGilson 96 liquid-handler. The rich TB broth does not allow for iso-tope enrichment, therefore only 1D 1H NMR spectra can be pre-

R3580C_001_01

Monomer

me (mL)10 15 20

ta was collected on a miniDAWN Light Scattering instrument (Wyatt Technology) atby static light scattering at 90� (LS) (red-trace) and refractive index (blue-trace) is

Page 9: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33 29

formed. However, this screen can detect the dispersion of amideprotons and upfield-shifted methyl protons that are indicative ofaromatic and methyl stacking (folded protein core). Proteins exhib-iting these traits, well folded and disperse amide protons, are morethan likely amenable for structure determination by NMR and canbe scaled-up for fermentation and purification with isotopeenrichment.

5. Preparative-scale protein sample production

Proteins with high expression and solubility levels, high mono-dispersity, and/or good 1D 1H NMR spectra are promoted for large-scale expression, purification, and sample preparation.

5.1. Preparative-scale fermentation

Although recent technical advances have in some cases allowedstructural determination with as little as 75 lg of protein (Araminiet al., 2007), the amount of protein required for crystallizationscreening and/or structure determination by NMR is typically 5–50 mg, with greater than 95% purity. Our process for preparative-scale (large-scale) protein expression has been designed to opti-mize conditions with respect to yield, cost, throughput, and the dif-ferent structural determination approaches. A strategy based onfermentation in 2-l baffled Furnbach flasks was chosen becauseof its simplicity, the low cost of the associated equipment suchas shakers, and ease of parallelization (Acton et al., 2005). In addi-tion, NMR structural determination requires enrichment with 15N,13C, and/or 2H isotopes. Conversely, high-throughput X-ray crystal-lography of proteins is most efficient using single (SAD) and multi-ple anomalous diffraction (MAD) methods (Hendrickson, 1991)requiring selenomethionine substituted protein samples. In orderto achieve this we have developed a fermentation system basedon growth with MJ9 minimal media (Jansson et al., 1996). This al-lows both isotopic (i.e. 15N, 13C, and/or 2H) enrichment or seleno-methionine labeling while achieving adequate cell density andprotein expression levels for structural biology studies.

Briefly, protein expression constructs that pass analytical scalecharacterization are identified through the PLIMS database, andtheir appropriate glycerol stock plate and well position reported(each plate has a unique bar code identification). An aliquot istransferred to 500 ll of LB with ampicillin and kanamycin andincubated for six hours at 37 �C. This pre-culture (40 ll) is thenused to inoculate a 250 ml flask containing 40 ml of MJ9 minimalmedia, and incubated overnight at 37 �C. For producing isotope-en-riched proteins for NMR (e.g. U-13C, U-15N-enriched proteins), theentire volume of overnight culture is then used to inoculate a 2-lbaffled flask containing 1.0 l of MJ9 supplemented with uniformly(U)-13C glucose and (U)-15NH4 salts as the sole source of carbonand nitrogen. For X-ray crystallography, non-isotope-enriched car-bon and ammonia sources are used. In both cases, the cultures areincubated at 37 �C until the OD600 reaches of 0.8–1.0 units, equili-brated to 17 �C, and induced with IPTG (1 mM final concentration).As a slight modification, in selenomethionine labeling, induction isdone 15 min after addition of several amino acids to the medium todown regulate methionine synthesis (lysine, phenylalanine, andthreonine at 100 mg/l, isoleucine, leucine, and valine at 50 mg/l,L-selenomethionine at 60 mg/l) (Doublie et al., 1996). The use ofa methionine auxotroph is often used for selenomethionine incor-poration (Walden, 2010). However, our strategy, repressing methi-onine synthesis, routinely results in 75–80% selenomethioninesubstitution and allows for the same expression host to be utilizedfor both NMR and X-ray crystallography sample production. Incu-bation with vigorous shaking in a 17 �C room continues overnightfollowed by harvesting through centrifugation. An aliquot of cells

at harvest is used for determining final cell density and for SDS–PAGE analysis of expression and solubility. An aliquot is also takenand sequence analyzed for quality control. During PSI-2, we ac-quired an Avanti centrifuge (Beckman) with the Harvestliner bagsystem, these centrifuge bags allow for storage in minimal space,as well as ease of cell resuspension in subsequent steps. All datais uploaded in the PLIMS database; select information useful forsharing across the NESG consortium and/or with the public dat-abases is transferred to our project-wide SPINE database (Bertoneet al., 2001; Goh et al., 2003).

5.2. Large-scale parallel protein purification using ÄKTAxpress systems

For both X-ray crystallography and NMR structural studies, it isimperative that the protein samples are highly homogeneous. Theneed to produce protein samples of sufficient purity while retain-ing high throughput is challenging. For preparing samples foreither NMR or X-ray crystallography, the centrifuge bags arethawed on ice and 30 ml of lysis buffer containing protease inhib-itors (Complete, Mini, EDTA-free, Roche) are used to resuspend thecells. The bag contents are then transferred to a metal sonicationcup and sonicated in an ice water bath for 5 � 60 s cycles (10 son/10 s off). The supernatant is cleared by centrifugation at27,000 � g for 30 min, followed by filtering through a 0.2 lm filter.The supernatant is then loaded onto an ÄKTAxpress system (GEHealthcare) and a two-step automated purification protocol is per-formed, comprised of a Ni-affinity column (HisTrap HP, 5 ml), and agel filtration column (Superdex 75 26/60, GE Healthcare) in a linearseries using the preinstalled default settings (AF-GF). Briefly, the6X-His-tagged proteins are eluted from the HisTrap column usingfive column volumes of elution buffer (50 mM Tris–HCl, 500 mMNaCl, 500 mM imidazole, 1 mM TCEP, pH 7.5) at 4 ml/min. The pro-teins are automatically detected by monitoring absorbance at280 nm, and fractions above the designated threshold (majorpeaks) are collected into internal storage loops. The major peaksare then automatically injected onto the Superdex 75 gel filtrationcolumn equilibrated with low salt buffer (20 mM Tris–HCl,100 mM NaCl, 5 mM DTT, pH 7.5) or Standard NMR Buffers (Ta-ble 2). Resulting protein fractions above the designated absorbancethreshold are collected into 2 ml 96-well blocks and the purifica-tion trace for each protein is archived into the Spine database.The ÄKTAxpress system is modular in design with four HisTrapHP columns and one size-exclusion column per module allowingfour separate two-step purifications in less than twelve hours.Overall, we have found this system to be extremely robust.

5.3. Sample preparation

The fractions produced on the ÄKTAxpress are analyzed bySDS–PAGE and pooled followed by concentration using Amiconultrafiltration concentrators (Millipore). The preparation is thensubjected to a series of quality control and analytical protein chem-istry steps including aggregation screening by analytical gel filtra-tion with static light scattering, homogeneity analysis using SDS–PAGE, molecular weight validation by MALDI-TOF mass spectrom-etry, and concentration determination by a NanoDrop ND-8000Spectrophotometer. This data is then archived into the SPINEdatabase for use by researchers throughout the NESG.

For NMR sample preparation, the fractions produced on the ÄK-TAxpress are analyzed by SDS–PAGE and pooled, followed by con-centration using Amicon ultrafiltration concentrators (Millipore).All samples are spiked with 50 mM DSS (4,4-dimethyl-4-silapen-tane-1-sulfonic acid) as an internal reference, 1� Complete Prote-ase cocktail (Roche) and 10% 2H2O. For NMR microprobe screening,aliquots (8 or 35 ll) are then transferred to 1.0-mm or 1.7-mmSampleJet Tubes (Bruker), respectively, using a Gilson 96-well

Page 10: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

Table 2Buffers used for NMR buffer optimization screening.

Buffer ID pH Recipe

MJ001 6.5 20 mM MESb, 100 mM NaCl, 5 mM CaCl2, 10 mM DTTa, 0.02% NaN3, protease inhibitorf 1�, 10% D2OMJ002 5.5 20 mM NH4OAc, 100 mM NaCl, 5 mM CaCl2, 10 mM DTT, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ003 4.5 20 mM NH4OAc, 100 mM NaCl, 5 mM CaCl2, 10 mM DTT, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ004 5.0 50 mM NH4OAc, 10 mM DTT, 50 mM arginine, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ005 5.0 50 mM NH4OAc, 10 mM DTT, 5% CH3CN, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ006 6.0 50 mM MES, 10 mM DTT, 50 mM arginine, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ007 6.0 50 mM MES, 10 mM DTT, 5% CH3CN, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ008 6.5 25 mM Na2PO4, 450 mM NaCl, 10 mM DTT, 20 mM ZnSO4, protease inhibitor 1�, 0.02% NaN3, 10% D2OMJ009 6.5 20 mM MES, 100 mM NaCl, 5% CH3CN, 10 mM DTT, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ010 6.5 20 mM MES, 100 mM NaCl, 50 mM Arginine, 10 mM DTT, 0.02% NaN3, protease inhibitor 1�, 10% D2OMJ011 6.5 20 mM MES, 100 mM NaCl, 1% Zwitterc, 10 mM DTT, 0.02% NaN3, 10% D2OMJ012 6.5 20 mM MES, 100 mM NaCl, 50 mM ZnSO4, 10 mM DTT, 0.02% NaN3, protease inhibitor 1�, 10% D2O

a DTT: Dithiothreitol.b MES: 2-(N-morpholino)ethanesulfonic acid.c Zwitter: ZWITTERAGENT� 3–12 detergent cat. 963015 (CALBIOCHEM).d TCEP: tris(2-carboxyethyl)phosphine.e Tris: tris(hydroxymethyl)aminomethane.f Protease inhibitor: Protease inhibitor cocktail tablets cat. 11836170001 (ROCHE).

30 R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33

liquid-handler. Samples destined for structure determination aretransferred into a 5 mm Shigemi tube (BP50) with a volume of300 ll, and stored at 4 �C.

For preparation of crystallization screening samples, selenome-thionine labeled proteins are concentrated to �10 mg/ml in lowsalt buffer (10 mM Tris–HCl, pH 7.5, 100 mM NaCl, 5 mM DTT).Protein samples are aliquoted in 100 ll portions, flash-frozen in li-quid nitrogen, and stored at �80 �C. These samples are then usedfor crystallization screening (Luft et al., 2003).

In addition to archiving the data generated during the purifica-tion procedure, the SPINE database also serves to direct shipmentsof protein samples to NESG researchers outside of the protein pro-duction lab. The majority of crystal screening and structure deter-mination is performed outside of the protein production lab andNMR samples are also shipped for structural determination by var-ious NESG researchers. The SPINE database coordinates this effortwith bar code based registration of shipment tubes and automati-cally tracks shipments through the FedEx database.

6. Protein salvage strategies

For proteins providing marginal quality (e.g. ‘‘Promising’’) HSQCspectra or crystal screening hits that cannot be optimized to pro-vide diffraction quality crystals, several ‘‘salvage’’ processes havebeen developed to provide more tractable protein samples. Someof the most effective strategies include sample buffer optimizationusing microprobe NMR screening (Rossi et al., 2010) and furtherconstruct optimization using amide hydrogen deuterium exchangewith mass spectrometry detection (HDX-MS) (Sharma et al., 2009).

6.1. Buffer optimization

Proteins are identified for buffer optimization during the initialscreening process carried out with a standard NMR buffer at pH 6.5(or pH 4.5) (Table 2). If a protein sample is deemed adequate forstructure determination but unstable such that prohibitive precip-itation would occur during the data acquisition time period, or pro-vided marginal quality HSQC spectra, then the sample is directedto buffer optimization. Briefly, a purified protein for a given targetis exchanged into twelve buffer conditions (Table 2) using a Zeba™96-well desalting spin plate (Thermo Scientific) and loaded intoseparate corresponding NMR microprobe tubes. The tubes arescored for precipitation following a set time interval equal to theaverage data acquisition time for an NMR study. This is followedby NMR screening. This identifies the most stable buffer conditions

and future samples of this protein are prepared in the identifiedbuffer. A more detailed description of sample preparation for buf-fer optimization has been previously described (Rossi et al., 2010).

6.2. Construct optimization using amide hydrogen deuteriumexchange with mass spectrometry detection (HDX-MS)

The disorder prediction methods described above using the Dis-Meta server have improved the efficiency of our production pipe-line. In some instances, however, these predictions do notreliably identify the disordered regions of the protein. In thesecases, NMR screening can identify that some regions of the proteinare disordered, even without determining the resonance assign-ments and location of the disordered regions. In order to refinesuch constructs, we have implemented and automated the HDX-MS procedure (Sharma et al., 2009; Englander, 2006; Woods andHamuro, 2001) for the experimental identification of the bound-aries of disordered protein segments. Once identified, alternateconstructs are designed to delete these disordered regions, andreintroduced into the protein production pipeline for recloning,protein purification and attempts for structural determination. Asan example, in a recent pilot study on a small set of targets fromthe NESG (Sharma et al., 2009), we demonstrated the value of usingHDX-MS to design truncated constructs yielding NMR spectra thatare more amenable for NMR structure determination andcrystallization.

6.3. Alternative expression systems

6.3.1. Wheat germ cell-free protein expressionA recent analysis of expression and solubility data derived from

the NESG pipeline attests to the difficulty of producing eukaryoticproteins in E. coli. Whereas 49% of bacterial target domains weresolubly expressed, only 32% of eukaryotic target domains were sol-ubly expressed in bacteria. Previous studies have shown thateukaryotic cell-free systems may permit successful production ofproteins that undergo proteolysis or accumulate in inclusionbodies during bacterial expression (Vinarov et al., 2006). To furtherexplore and develop this technology, we have modified the Prome-ga TnT wheat germ cell-free (WGCF) vector to allow ligase inde-pendent cloning of the PCR products used in our normal cloningpipeline. A pilot study (Zhao et al., 2010) using this system with66 non-secreted human targets that were problematic in the pro-karyotic expression system produced very promising results. Fol-lowing wheat germ cell-free expression we have found that 9 out

Page 11: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33 31

of 13 bacterially expressed but insoluble proteins were solubly ex-pressed in WGCF. In total, 34 of the 66 non-secreted human targets(52%) were solubly expressed in WGCF. The major drawback to thissystem is the fact that only small quantities of protein can be pro-duced. However, recent advances in NMR microprobe technologyhave made it possible to determine structures with the roughly100 lg levels of protein routinely produced in this system (Araminiet al., 2007; Rossi et al., 2010).

6.3.2. Chaperone enhanced E. coli expression strainsThe NESG has also developed E. coli strains in which one or

more E. coli chaperones (including GroEL/ES, trigger factor, DnaK/J, GrpE, and ClpB) can be induced during recombinant proteinexpression. These systems exploit the pACYCDuet vector system(Novagen), in which multiple T7lac promoters control expressionof the chaperones. These vectors are compatible with our modifiedpET15 and pET21 vectors, the workhorse plasmids of the NESG,and can co-exist within BL21(DE3) in a stable fashion. We havefound that these chaperones can aid in producing soluble proteinexpression. As a case study, protein target ER58 (COAE_ECOLI)has limited solubility when expressed in our normal T7 system.However, co-expression with trigger factor in a pDuet systemgreatly enhances solubility. Using this approach resulted in thestructure of protein target ER58 (PBD ID: 1spv).

7. Future perspectives

The attrition rate in protein production of human and othereukaryotic targets is considerably higher and continues througheach step in the pipeline. However, we view eukaryotic protein tar-gets as very high value in spite of the added difficulty. This will alsobe a driving force for technology development going forward.Although challenges exist, the current NESG Protein ProductionPipeline has proven very robust for producing some eukaryoticproteins in a form amenable to 3D structure determination. Todate, the NESG has deposited into the Protein Data Bank (PDB)some 80 eukaryotic protein structures, including some 50 humanprotein structures. This represents approximately 10% of the totalNESG structure count of over 900 PDB submissions.

Although eukaryotic targets have higher attrition rates thanprokaryotic targets we have made considerable progress as shownin Fig. 6. During PSI-2 we have cloned over 1200 human proteindomains as 2850 constructs (2–10 constructs per domain basedon the number of predicted disordered regions in the target).Roughly 4% of cloned human domains have progressed to struc-tural determination, and many more of these samples will yieldstructures in the coming months. This success rate greatly eclipsesthe rate we found with over 1000 full-length human proteins (<1%)in PSI-1. A large factor in this increased efficiency is no doubt a re-

Fig. 6. The percentage of PDB depositions relative to cloned targets from various origins aX-ray crystallography. Target classes are from left to right Prokaryotic (Prok), Eukaryo(3.30.530.40), MCSG Salvage (small soluble proteins that failed to crystallize), BIG-Fam(3.20.20.70), Polysaccaride synthesis, (3.90.550.10), Oxidoreductase (3.20.20.70) and VP3indicated in parenthesis where appropriate.

sult of the disorder-based construct optimization software. Asshown in the bar graph in Fig. 6, it appears that NMR has a signif-icantly higher success rate than X-ray crystallography in producingstructures from these optimized human constructs. Conversely,many other protein families such as the meta domains and Tol-Bmega-family among others are more successful as X-ray crystallog-raphy targets. This is consistent with previous reports indicatingthe complementary nature of the structural determination ap-proaches (Snyder et al., 2005; Yee et al., 2005). Although we havegreatly increased efficiency in producing protein samples of humandomains amenable for structural determination, prokaryotic tar-gets continue to yield higher success rates (�6% ‘‘In PDB”/cloned).

Realizing the need for producing eukaryotic proteins in eukary-otic hosts, the NESG is investing in many promising technologiesfor eukaryotic protein sample production, such as WGCF coupledwith NMR microprobe technology, as described above. Unfortu-nately, the yield of proteins from WGCF is often not sufficient forX-ray crystallography studies in a relatively cost effective manner.Accordingly, we are also exploring other technologies that producethe larger amounts of eukaryotic proteins necessary for crystalliza-tion studies. NESG researchers have explored a Pichia pastorisexpression system that has been successful in producing solublesecreted human proteins. Fermentation yields from this organismcan often be in the tens of milligram range. Another promisingeukaryotic expression system we are exploring is a humanHEK293T cell-based expression systems in conjunction with newbioreactor technology. This technology involves large-scale pro-duction of recombinant secreted proteins using the ‘‘BelloCell”oscillating bioreactor, one-liter the surface area of over 20 1-l rollerbottles (Wang et al., 2006; Ho et al., 2004). Using this technology,NESG researchers have produce 3–4 mg samples of secretedglycosylated human proteins, amounts suitable for crystallizationstudies, while greatly reducing media cost.

Going forward we will continue to develop the NESG ProteinProduction Pipeline with the goal of increased efficiency andbroader applicability to eukaryotic proteins and protein com-plexes. There are several areas that can be improved as technologyis developed. One area is increased use of total gene synthesis.Although prohibitively expensive to use as a primary source ofPCR template today, it continues to decline in cost. Gene optimiza-tion for expression in E. coli has in our hands resulted in increasedexpression levels. In many cases it is the difference between noexpression with the natural sequence to high levels of expressionwith the codon optimized synthetic genes. Although we typicallyemploy rare codon-enhanced strains of BL21, not all genes can berescued by this strategy. Total gene synthesis will no doubt playa large role in biology research going forward. Additionally, we willdevelop eukaryotic host systems, more robust vectors, and im-proved purification procedures for the eukaryotic targets. Often,

nd target classes. The graph is further broken down into targets solved by NMR andtic (Euk), Human, Metagenomic, Ubiqutitin, OB-Fold (2.40.50.140), Start Domainsily, P-loop (3.40.50.300), NADP-binding (3.40.50.720), Tol-B (2.120.10.30), Aldolase9 (3.40.50.150). CATH superfamily identification numbers (Greene et al., 2007) are

Page 12: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

32 R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33

the lower expression rates and more limited solubility of eukary-otic proteins leads to protein purifications that are less homoge-nous following our current high-throughput procedures, andvarious approaches are being explored to improve the purificationprocess for these targets.

8. Conclusions

This paper illustrates the overall sample production and screen-ing technologies of the NESG. It is both scalable and efficient in re-gard to cost (initial setup of equipment and supplies) and time. Wehave outlined our strategies and platform for HTP production ofhigh-quality protein samples using E. coli expression systems.The platform contains powerful bioinformatics tools (software,database, and servers), HTP ligation-independent cloning using aBioRobot 8000, HTP Midi-scale fermentation for biophysical char-acterization, and multi-module large-scale protein purificationsystem using ÄKTAxpress systems, among other notable technol-ogy. Detailed protocols can be found at (http://www-nmr.cabm.rutgers.edu/labdocuments/proteinprod/index.htm) andare freely available.

Unlike many other reported protein production pipelines, thissystem is easily scalable, adding more equipment or personnel willeasily increase output. Perhaps even more important, this systemwas engineered to work with commercially available equipmentand reagents, such that it can be duplicated in any structural biol-ogy lab or core facility. Much of the technology designed for thisproject will likely prove useful for other structural biologists, suchas the disorder-based construct design, robotic cloning and analyt-ical expression.

With the low cost of primers and relatively inexpensive Qiagenlab automation the barrier to this strategy is low. One of the maingoals of this paper to inform the community about our technologyand to make it accessible to other researchers. Overall, this pipelinehas allowed us to clone and express nearly 17,000 different proteintargets, purify over 4400 proteins in tens of milligram quantities,and deposit over 900 new protein structures to the PDB over thepast 10 years (http://nesg.org/statistics.html). Our current produc-tion rates are about 30–40 purified protein targets in tens of milli-gram quantities per week. This enables us to achieve success inproducing not only enough protein samples but also 3D structures.We hope that these improved automated and/or parallel cloning,expression, protein production, and biophysical screening technol-ogies will of value to genomic biologists, biochemists, and struc-tural biologists.

Acknowledgments

We thank Profs. C. Arrowsmith, J. Hunt, M. Gerstein, M. Inouye,J. Marcotrigiano, and L. Tong, along with all the members of theNESG Consortium, for valuable advice in the development of theNESG Protein Sample Production Platform. This work was sup-ported by a grant from the National Institute of General MedicalSciences Protein Structure Initiative U54-GM074958 (to G.T.M.).

References

Acton, T.B., Gunsalus, K.C., Xiao, R., Ma, L.C., Aramini, J., Baran, M.C., Chiang, Y.W.,Climent, T., Cooper, B., Denissova, N.G., Douglas, S.M., Everett, J.K., Ho, C.K.,Macapagal, D., Rajan, P.K., Shastry, R., Shih, L.Y., Swapna, G.V., Wilson, M., Wu,M., Gerstein, M., Inouye, M., Hunt, J.F., Montelione, G.T., 2005. Robotic cloningand Protein Production Platform of the Northeast Structural GenomicsConsortium. Methods Enzymol. 394, 210–243.

Acton, T.B., Lee, D.Y., Wang, H., Montelione, G.T., in preparation. An alternativemethod for generating genomic DNA PCR template and expansion of ‘ReagentGenomes’.

Aramini, J.M., Rossi, P., Anklin, C., Xiao, R., Montelione, G.T., 2007. Microgram-scaleprotein structure determination by NMR. Nat. Methods 4 (6), 491–493.

Aslanidis, C., de Jong, P.J., 1990. Ligation-independent cloning of PCR products (LIC-PCR). Nucleic Acids Res. 18 (20), 6069–6074.

Bertone, P., Kluger, Y., Lan, N., Zheng, D., Christendat, D., Yee, A., Edwards, A.M.,Arrowsmith, C.H., Montelione, G.T., Gerstein, M., 2001. SPINE: an integratedtracking database and data mining approach for identifying feasible targets inhigh-throughput structural proteomics. Nucleic Acids Res. 29 (13), 2884–2898.

Bousse, L., Mouradian, S., Minalla, A., Yee, H., Williams, K., Dubrow, R., 2001. Proteinsizing on a microchip. Anal. Chem. 73 (6), 1207–1212.

Chikayama, E., Kurotani, A., Tanaka, T., Yabuki, T., Miyazaki, S., Yokoyama, S.,Kuroda, Y., 2010. Mathematical model for empirically optimizing large scaleproduction of soluble protein domains. BMC Bioinformat. 11, 113.

Crowe, J., Dobeli, H., Gentz, R., Hochuli, E., Stuber, D., Henco, K., 1994. 6XHis-Ni-NTAchromatography as a superior technique in recombinant protein expression/purification. Methods Mol. Biol. 31, 371–387.

Dean, F.B., Hosono, S., Fang, L., Wu, X., Faruqi, A.F., Bray-Ward, P., Sun, Z., Zong, Q.,Du, Y., Du, J., Driscoll, M., Song, W., Kingsmore, S.F., Egholm, M., Lasken, R.S.,2002. Comprehensive human genome amplification using multipledisplacement amplification. Proc. Natl. Acad. Sci. USA 99 (8), 5261–5266.

Dessailly, B.H., Nair, R., Jaroszewski, L., Fajardo, J.E., Kouranov, A., Lee, D., Fiser, A.,Godzik, A., Rost, B., Orengo, C., 2009. PSI-2: structural genomics to cover proteindomain family space. Structure 17 (6), 869–881.

Doublie, S., Kapp, U., Aberg, A., Brown, K., Strub, K., Cusack, S., 1996. Crystallizationand preliminary X-ray analysis of the 9 kDa protein of the mouse signalrecognition particle and the selenomethionyl-SRP9. FEBS Lett. 384 (3), 219–221.

Dyson, M.R., Shadbolt, S.P., Vincent, K.J., Perera, R.L., McCafferty, J., 2004. Productionof soluble mammalian proteins in Escherichia coli: identification of proteinfeatures that correlate with successful expression. BMC Biotechnol. 4, 32.

Englander, S.W., 2006. Hydrogen exchange and mass spectrometry: a historicalperspective. J. Am. Soc. Mass Spectrom. 17 (11), 1481–1489.

Everett, J.K., Acton, T.B., Montelione, G.T., 2004. Primer Prim’r: a web based serverfor automated primer design. J. Funct. Struct. Genomics 5 (1–2), 13–21.

Ferre-D’Amare, A.R., Burley, S.K., 1994. Use of dynamic light scattering to assesscrystallizability of macromolecules and macromolecular assemblies. Structure2 (5), 357–359.

Ferre-D’Amare, A.R., Burley, S.K., 1997. Dynamic light scattering in evaluatingcrystallizability of macromolecule. Methods in Enzymology, vol. 276. AcademicPress, New York, pp. 157–166.

Gill, S.R., Pop, M., Deboy, R.T., Eckburg, P.B., Turnbaugh, P.J., Samuel, B.S., Gordon, J.I.,Relman, D.A., Fraser-Liggett, C.M., Nelson, K.E., 2006. Metagenomic analysis ofthe human distal gut microbiome. Science 312 (5778), 1355–1359.

Goh, C.S., Lan, N., Douglas, S.M., Wu, B., Echols, N., Smith, A., Milburn, D.,Montelione, G.T., Zhao, H., Gerstein, M., 2004. Mining the structural genomicspipeline: identification of protein properties that affect high-throughputexperimental analysis. J. Mol. Biol. 336 (1), 115–130.

Goh, C.S., Lan, N., Echols, N., Douglas, S.M., Milburn, D., Bertone, P., Xiao, R., Ma, L.C.,Zheng, D., Wunderlich, Z., Acton, T., Montelione, G.T., Gerstein, M., 2003. SPINE2: a system for collaborative structural proteomics within a federated databaseframework. Nucleic Acids Res. 31 (11), 2833–2838.

Graslund, S., Nordlund, P., Weigelt, J., Hallberg, B.M., Bray, J., Gileadi, O., Knapp, S.,Oppermann, U., Arrowsmith, C., Hui, R., Ming, J., dhe-Paganon, S., Park, H.W.,Savchenko, A., Yee, A., Edwards, A., Vincentelli, R., Cambillau, C., Kim, R., Kim,S.H., Rao, Z., Shi, Y., Terwilliger, T.C., Kim, C.Y., Hung, L.W., Waldo, G.S., Peleg, Y.,Albeck, S., Unger, T., Dym, O., Prilusky, J., Sussman, J.L., Stevens, R.C., Lesley, S.A.,Wilson, I.A., Joachimiak, A., Collart, F., Dementieva, I., Donnelly, M.I.,Eschenfeldt, W.H., Kim, Y., Stols, L., Wu, R., Zhou, M., Burley, S.K., Emtage, J.S.,Sauder, J.M., Thompson, D., Bain, K., Luz, J., Gheyi, T., Zhang, F., Atwell, S., Almo,S.C., Bonanno, J.B., Fiser, A., Swaminathan, S., Studier, F.W., Chance, M.R., Sali, A.,Acton, T.B., Xiao, R., Zhao, L., Ma, L.C., Hunt, J.F., Tong, L., Cunningham, K., Inouye,M., Anderson, S., Janjua, H., Shastry, R., Ho, C.K., Wang, D., Wang, H., Jiang, M.,Montelione, G.T., Stuart, D.I., Owens, R.J., Daenke, S., Schutz, A., Heinemann, U.,Yokoyama, S., Bussow, K., Gunsalus, K.C., 2008a. Protein production andpurification. Nat. Methods 5 (2), 135–146.

Graslund, S., Sagemark, J., Berglund, H., Dahlgren, L.G., Flores, A., Hammarstrom, M.,Johansson, I., Kotenyova, T., Nilsson, M., Nordlund, P., Weigelt, J., 2008b. The useof systematic N- and C-terminal deletions to promote production and structuralstudies of recombinant proteins. Protein Exp. Purif. 58 (2), 210–221.

Greene, L.H., Lewis, T.E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O., Pearl,F., Nambudiry, R., Reid, A., Sillitoe, I., Yeats, C., Thornton, J.M., Orengo, C.A., 2007.The CATH domain structure database: new protocols and classification levelsgive a more comprehensive resource for exploring evolution. Nucleic Acids Res.35 (Database issue), D291–D297.

Haun, R.S., Moss, J., 1992. Ligation-independent cloning of glutathione S-transferasefusion genes for expression in Escherichia coli. Gene 112 (1), 37–43.

Hendrickson, W.A., 1991. Determination of macromolecular structures fromanomalous diffraction of synchrotron radiation. Science 254 (5028), 51–58.

Ho, L., Greene, C.L., Schmidt, A.W., Huang, L.H., 2004. Cultivation of HEK 293 cell lineand production of a member of the superfamily of G-protein coupled receptorsfor drug discovery applications using a highly efficient novel bioreactor.Cytotechnology 45 (3), 117–123.

Huang, Y.J., Hang, D., Lu, L.J., Tong, L., Gerstein, M.B., Montelione, G.T., 2008.Targeting the human cancer pathway protein interaction network by structuralgenomics. Mol. Cell Proteomics 7 (10), 2048–2060.

Jansson, M., Li, Y.-C., Jendenberg, L., Anderson, S., Montelione, G.T., 1996. High-levelproduction of uniformly 15N- and 13C-enriched fusion proteins in Escherichiacoli. J. Biomol. NMR 7, 131–141.

Page 13: Journal of Structural Biology - CESCO BioProductsThe novel structural information generated can then be utilized in modeling thousands of additional proteins (or protein domains)

R. Xiao et al. / Journal of Structural Biology 172 (2010) 21–33 33

Klock, H.E., Koesema, E.J., Knuth, M.W., Lesley, S.A., 2008. Combining thepolymerase incomplete primer extension method for cloning andmutagenesis with microscreening to accelerate structural genomics efforts.Proteins 71 (2), 982–994.

Kvist, T., Ahring, B.K., Lasken, R.S., Westermann, P., 2007. Specific single-cellisolation and genomic amplification of uncultured microorganisms. Appl.Microbiol. Biotechnol. 74 (4), 926–935.

Lamesch, P., Li, N., Milstein, S., Fan, C., Hao, T., Szabo, G., Hu, Z., Venkatesan, K.,Bethel, G., Martin, P., Rogers, J., Lawlor, S., McLaren, S., Dricot, A., Borick, H.,Cusick, M.E., Vandenhaute, J., Dunham, I., Hill, D.E., Vidal, M., 2007. HORFeomev3.1: a resource of human open reading frames representing over 10,000human genes. Genomics 89 (3), 307–315.

Lasken, R.S., 2007. Single-cell genomic sequencing using Multiple DisplacementAmplification. Curr. Opin. Microbiol. 10 (5), 510–516.

Lasken, R.S., 2009. Genomic DNA amplification by the multiple displacementamplification (MDA) method. Biochem. Soc. Trans. 37 (Pt 2), 450–453.

Liu, J., Hegyi, H., Acton, T.B., Montelione, G.T., Rost, B., 2004. Automatic targetselection for structural genomics on eukaryotes. Proteins 56 (2), 188–200.

Luft, J.R., Collins, R.J., Fehrman, N.A., Lauricella, A.M., Veatch, C.K., DeTitta, G.T., 2003.A deliberate approach to screening for initial crystallization conditions ofbiological macromolecules. J. Struct. Biol. 142 (1), 170–179.

Montelione, G.T., Anderson, S., 1999. Structural genomics: keystone for a HumanProteome Project. Nat. Struct. Biol. 6 (1), 11–12.

Nair, R., Liu, J., Soong, T.T., Acton, T.B., Everett, J.K., Kouranov, A., Fiser, A., Godzik, A.,Jaroszewski, L., Orengo, C., Montelione, G.T., Rost, B., 2009. Structural genomicsis the largest contributor of novel structural leverage. J. Struct. Funct. Genomics10 (2), 181–191.

Netzer, W.J., Hartl, F.U., 1997. Recombination of protein domains facilitated by co-translational folding in eukaryotes. Nature 388 (6640), 343–349.

Peti, W., Page, R., Moy, K., O’Neil-Johnson, M., Wilson, I.A., Stevens, R.C., Wuthrich,K., 2005. Towards miniaturization of a structural genomics pipeline usingmicro-expression and microcoil NMR. J. Struct. Funct. Genomics 6 (4), 259–267.

Price 2nd, W.N., Chen, Y., Handelman, S.K., Neely, H., Manor, P., Karlin, R., Nair, R.,Liu, J., Baran, M., Everett, J., Tong, S.N., Forouhar, F., Swaminathan, S.S., Acton, T.,Xiao, R., Luft, J.R., Lauricella, A., DeTitta, G.T., Rost, B., Montelione, G.T., Hunt, J.F.,2009. Understanding the physical properties that control protein crystallizationby analysis of large-scale experimental data. Nat. Biotechnol. 27 (1), 51–57.

Punta, M., Love, J., Handelman, S., Hunt, J.F., Shapiro, L., Hendrickson, W.A., Rost, B.,2009. Structural genomics target selection for the New York consortium onmembrane protein structure. J. Struct. Funct. Genomics 10 (4), 255–268.

Rossi, P., Swapna, G.V., Huang, Y.J., Aramini, J.M., Anklin, C., Conover, K., Hamilton,K., Xiao, R., Acton, T.B., Ertekin, A., Everett, J.K., Montelione, G.T., 2010. Amicroscale protein NMR sample screening pipeline. J. Biomol. NMR 46 (1), 11–22.

Rual, J.F., Hirozane-Kishikawa, T., Hao, T., Bertin, N., Li, S., Dricot, A., Li, N.,Rosenberg, J., Lamesch, P., Vidalain, P.O., Clingingsmith, T.R., Hartley, J.L.,Esposito, D., Cheo, D., Moore, T., Simmons, B., Sequerra, R., Bosak, S., Doucette-Stamm, L., Le Peuch, C., Vandenhaute, J., Cusick, M.E., Albala, J.S., Hill, D.E., Vidal,

M., . Human ORFeome version 1.1: a platform for reverse proteomics. GenomeRes. 14 (10B), 2128–2135.

Sharma, S., Zheng, H., Huang, Y.J., Ertekin, A., Hamuro, Y., Rossi, P., Tejero, R., Acton,T.B., Xiao, R., Jiang, M., Zhao, L., Ma, L.C., Swapna, G.V., Aramini, J.M., Montelione,G.T., 2009. Construct optimization for protein NMR structure analysis usingamide hydrogen/deuterium exchange mass spectrometry. Proteins 76 (4), 882–894.

Sheibani, N., 1999. Prokaryotic gene fusion expression systems and their use instructural and functional studies of proteins. Prep. Biochem. Biotechnol. 29 (1),77–90.

Shirano, Y., Shibata, D., 1990. Low temperature cultivation of Escherichia colicarrying a rice lipoxygenase L-2 cDNA produces a soluble and active enzyme ata high level. FEBS Lett. 271 (1–2), 128–130.

Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I.A., Lesley, S.A., Godzik, A.,2007. XtalPred: a web server for prediction of protein crystallizability.Bioinformatics 23 (24), 3403–3405.

Snyder, D.A., Chen, Y., Denissova, N.G., Acton, T., Aramini, J.M., Ciano, M., Karlin, R.,Liu, J., Manor, P., Rajan, P.A., Rossi, P., Swapna, G.V., Xiao, R., Rost, B., Hunt, J.,Montelione, G.T., 2005. Comparisons of NMR spectral quality and success incrystallization demonstrate that NMR and X-ray crystallography arecomplementary methods for small protein structure determination. J. Am.Chem. Soc. 127 (47), 16505–16511.

Vinarov, D.A., Newman, C.L. Loushin, Markley, J.L., 2006. Wheat germ cell-freeplatform for eukaryotic protein production. FEBS J. 273 (18), 4160–4169.

Walden, H., 2010. Selenium incorporation using recombinant techniques. ActaCrystallogr. D: Biol. Crystallogr. 66 (Pt 4), 352–357.

Wang, I.K., Hsieh, S.Y., Chang, K.M., Wang, Y.C., Chu, A., Shaw, S.Y., Ou, J.J., Ho, L.,2006. A novel control scheme for inducing angiostatin-human IgG fusionprotein production using recombinant CHO cells in a oscillating bioreactor. J.Biotechnol. 121 (3), 418–428.

Woods Jr., V.L., Hamuro, Y., 2001. High resolution, high-throughput amidedeuterium exchange-mass spectrometry (DXMS) determination of proteinbinding site structure and dynamics: utility in pharmaceutical design. J. Cell.Biochem. Suppl. 37, 89–98.

Yee, A.A., Savchenko, A., Ignachenko, A., Lukin, J., Xu, X., Skarina, T., Evdokimova, E.,Liu, C.S., Semesi, A., Guido, V., Edwards, A.M., Arrowsmith, C.H., 2005. NMR andX-ray crystallography, complementary tools in structural proteomics of smallproteins. J. Am. Chem. Soc. 127 (47), 16512–16517.

Zhang, Q., Horst, R., Geralt, M., Ma, X., Hong, W.X., Finn, M.G., Stevens, R.C.,Wuthrich, K., 2008. Microscale NMR screening of new detergents for membraneprotein structural biology. J. Am. Chem. Soc. 130 (23), 7357–7363.

Zhao, L., Zhao, K., Hurst, R., Slater, M., Acton, T.B., Swapna, G.V.T., Shastri, R.,Kornhaber, G.J., Montelione, G.T., 2010. Engineering of a wheat germ expressionsystem to provide compatibility with a high throughput pET-based cloningplatform. J. Struct. Funct. Genomics. doi:10.1007/s10969-010-9093-8.

Zhu, B., Cai, G., Hall, E.O., Freeman, G.J., 2007. In-fusion assembly: seamlessengineering of multidomain fusion proteins, modular vectors, and mutations.Biotechniques 43 (3), 354–359.