trends in genetics_-_october_2013

56

Upload: kevin-fitts

Post on 07-May-2015

1.424 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Trends in genetics_-_october_2013
Page 2: Trends in genetics_-_october_2013

Editor Rhiannon Macrae

Portfolio ManagerMilka Kostic

Journal ManagerBasil Nyaku

Journal AdministratorsRia Otten and Patrick Scheffmann

Advisory Editorial BoardK.V. Anderson, New York, USAA. Clark, Ithaca, USAG. Fink, Cambridge, USAS. Gasser, Geneva, SwitzerlandD. Goldstein, Durham, USAL. Guarente, Cambridge, USAY. Hayashizaki, Yokohama, Japan S. Henikoff, Seattle, USAJ. Hodgkin, Oxford, UKH.R. Horvitz, Cambridge, USAL. Hurst, Bath, UKE. Koonin, Bethesda, USAE. Meyerowitz, Pasadena, USAS. Moreno, Salamanca, SpainA. Nieto, Alicante, SpainC. Scazzocchio, Orsay, France and London, UKD. Tautz, Plön, GermanyO. Voinnet, Strasburg, FranceJ. Wysocka, Stanford, California

Editorial EnquiriesTrends in GeneticsCell Press600 Technology Square, 5th floorCambridge MA 02139, USATel: +1 617 397 2818Fax: +1 617 397 2810E-mail: [email protected]

Cover: In this special issue of Trends in Genetics, we turn the lens on ourselves. The articles this month focus on human genetics, with topics ranging from resources and methods to make the most of the explosion of sequencing data to evolutionary questions about mutation rates and how selection acts through pregnancy. Cover image: iStock\KameleonMedia.

October 2013 Volume 29, Number 10 pp. 555–608

Jeffrey A. Fawcett and Hideki Innan

Eli Eisenberg and Erez Y. Levanon

561 The role of gene conversion in preserving rearrangement hotspots in the human genome

569 Human housekeeping genes, revisited

Opinions

559 LongevityMap: a database of human genetic variants associated with longevity

556 Genome sequencing for healthy individuals

Arie Budovsky, Thomas Craig, Jingwei Wang, Robi Tacutu, Attila Csordas, Joana Lourenço, Vadim E. Fraifeld, and João Pedro de Magalhães

Saskia C. Sanderson

Spotlight

Reviews

Catarina D. Campbell and Evan E. Eichler

Elizabeth A. Brown, Maryellen Ruvolo, and Pardis C. Sabeti

David C. Samuels, Leng Han, Jiang Li, Sheng Quanghu, Travis A. Clark, Yu Shyr, and Yan Guo

Nir Oksenberg and Nadav Ahituv

Feature Review

575 Properties and rates of germline mutations in humans

585 Many ways to die, one way to arrive: how selection acts through pregnancy

593 Finding the lost treasures in exome sequencing data

600 The role of AUTS2 in neurodevelopment and human evolution

Science & Society

555 Inherited uncertainty Rhiannon Macrae

Editorial

Special Issue: Human Genetics

Page 3: Trends in genetics_-_october_2013

Inherited uncertainty

Rhiannon Macrae

My college physics textbook contained an anecdote about aphysics professor who used to joke that instead of giving aseminar as part of their thesis defense, students shouldinstead demonstrate their faith in physical principles bywalking over a bed of hot coals. The trick is to get your feetwet first (hence, many people walk across dewy grassbefore stepping on to the coals), and the moisture willcreate an insulating vapor barrier through a phenomenoncalled the Leidenfrost effect, protecting your bare skin fromthe heat of the coals. If walking across hot coals is theultimate test of a physicist’s faith in the laws of theuniverse, the equivalent for a geneticist is having a baby(Figure 1).

Although it was not until Gregor Mendel presented hiswork in 1865 that inheritance was formally quantitated,humans innately understood the concept of heredity wellbefore then. Perhaps the most pervasive evidence of thiscomes from breeding programs dating back to prehistorictimes, in which animals or plants with desirable traitswere selectively bred. Plato wrote about extending theseideas to humans, and history is full of examples of knownfamilial diseases, such as hemophilia. The development ofmolecular genetics transformed these observations into amechanistic understanding of the hereditary material, andnow with the advent of genomic technologies, a full pictureof inheritance is beginning to emerge. Efforts are under-way to identify the genetic changes underlying everyknown Mendelian disorder (http://mendelian.org/) andmuch work has been done to demonstrate associationsbetween genetic variants and human traits (e.g., theGIANT consortium). It is easy to see in these systematicapproaches a future of predictable genetic outcomes.

The reality of the uncertainty in what lies in an indi-vidual’s DNA, however, announces itself along with thenews of pregnancy. Although prenatal genetic screening isnow routinely offered for some diseases, such as cysticfibrosis carrier testing or trisomy screening, thousands ofknown causal variants go untested, despite the feasibilityof noninvasive fetal genome sequencing. Even with thisnew technology, the unknown variants and the dreaded‘variants of unknown significance’ continue to pose chal-lenges to our understanding of the genotype–phenotyperelation. I suspect most expecting parents do not phrasetheir fears in those terms, but I would venture that most ifnot all are hoping not so much for a boy or a girl, but for ahealthy baby. Luckily for the parents (and the humanrace), this wish is often granted, allowing parents to refo-cus all their energy on raising their healthy baby, arriving

at another classic debate in genetics – nature versusnurture.

For indeed, your DNA is not your fate. Our prehistoricancestors knew that even crops planted from the hardiestand most productive parents would fail in a drought.A catalog of all the disease-associated variants in thehuman genome would still only provide probabilities ofoutcomes in many cases, and it is difficult to imagine analgorithm sophisticated enough to consider all of the genex–environment interactions that could influence thoseprobabilities. Add in epigenetics, and it begins to feel asthough we know less about inheritance than Mendel did.

Nevertheless, we continue to put our faith in the pro-cesses that guide evolution and bring new lives into theworld. It would be nice if there was a simple trick to ensuresuccess, but for all the advice new parents receive, there isno equivalent to the suggestion to get your feet wet beforewalking across hot coals. Physicists are currently exploringthe limits of the universe, but geneticists are still expand-ing the limits of what is knowable. In this Special Issue onhuman genetics, authors tackle this question from a vari-ety of angles, from describing resources and methods forprobing the human genome to discussing how evolutionhas shaped our species. As we go to press, my husband andI will be completing the 9-month pilot phase of our ownhuman genetics project. Preliminary data indicate that it’sa healthy girl.

Editorial

TRENDS in Genetics

Figure 1. An ultrasound image at 12 weeks of pregnancy. Courtesy of Wolfgang

Moroder.

Corresponding author: Macrae, R. ([email protected]).

0168-9525/$ – see front matter � 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.07.007 Trends in Genetics, October 2013, Vol. 29, No. 10 555

Page 4: Trends in genetics_-_october_2013

Genome sequencing for healthy individuals

Saskia C. Sanderson

Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Genome sequencing of healthy individuals has the po-tential to lead to improved well-being and disease pre-vention, but numerous challenges remain that must beaddressed to realize these benefits and, importantly,these benefits must be equitable across society.

Sequencing people, not only patientsOver the past few years, several seemingly healthy indi-viduals have had their genomes sequenced, analyzed, andpublished in peer-reviewed scientific journals. These in-clude scientist Mike Snyder at Stanford [1], eight otherindividuals at Stanford [2], and participants in the Per-sonal Genome Project at Harvard [3]. There is considerablehope that whole-genome sequencing (WGS) in healthyindividuals will lead to great advances in disease preven-tion and improved well-being [4]. However, numerouschallenges and concerns exist, including the costs of ana-lyzing and interpreting WGS data as well as the potentialfor adverse outcomes such as confusion, anxiety, inappro-priate referrals, and overutilization of health services [5–7]. Although more research is required to evaluate thesepros and cons, if implemented fairly there is a greatpotential for WGS to improve the lives of people regardlessof whether or not they currently appear healthy.

The promise: improved health and well-beingSequencing the first human genome took 15 years and $3billion. Today, a human genome can be sequenced for�$3000 in a few days, and costs are expected to continueto fall. Although WGS is currently used primarily forclinical diagnostic and research purposes, WGS in seem-ingly healthy individuals has the promise to empowerthem to take greater control of their lives, and to takeaction to prevent diseases earlier and more effectively. Inthe future, WGS may provide healthy individuals withcarrier information relevant to reproductive decision-mak-ing and pharmacogenomic information to inform drugprescribing and dosage. It may also identify people whoappear healthy – but who have rare variants that greatlyincrease their risk of cancer or a cardiac event [8], orcombinations of common variants that modestly increasetheir risk of common, complex diseases such as type 2diabetes [2] or psychiatric conditions such as bipolar disor-der. This may enable doctors to intervene with medications

or procedures, and/or motivate individuals to make risk-reducing changes themselves, such as losing weight, quit-ting smoking, reducing stress, improving medication adher-ence, or increasing screening. There is significantcommercial as well as academic and public health interestin capitalizing on these potential advantages.

The challenges along the wayThere are also significant challenges to applying WGS inthe context of healthy individuals. WGS for a healthyindividual is an open-ended investigation: the sheer vol-ume of data that could potentially be informative is cur-rently overwhelming [9]. The nature of the data challengescurrent notions of what can be guaranteed regarding con-fidentiality and privacy [10]. Other policy aspects, such asthose related to discrimination and insurance [7], as wellas logistical issues including storage of such vast amountsof data [5] and access within electronic healthcare records[4], must also be considered.

The volume of data produced poses particular chal-lenges regarding analysis and interpretation [7]. Today,it takes many person-hours to curate, analyze and inter-pret the thousands of variants arising from WGS that maybe significant for a healthy individual. Vast amounts ofwork are involved in translating the raw data into compre-hensive but easy-to-understand results that can confident-ly be communicated back to the individual. Although theACMG provides guidelines regarding the return of inci-dental findings in clinical settings [11], deciding where todraw the line between known pathogenic and suspectedpathogenic variants is a major barrier to rapidly interpret-ing WGS data for healthy individuals. It is likely to be sometime before analysis and interpretation pipelines are fullyautomated and user interfaces enabling individuals toaccess results in meaningful ways are developed and wide-ly adopted.

Ethical considerations, including the implications forfamily members [7], also pose important challenges forWGS for healthy individuals. Crucially, the question of theappropriate age at which to consider introducing WGSneeds to be addressed. This was highlighted by the ACMGguidelines, which recommended returning incidental find-ings about specific, high-penetrance variants regardless ofage [11], sparking considerable debate. The notion of chil-dren or adolescents having their genomes sequenced, par-ticularly without an immediate clinical need, is ethicallychallenging and raises important questions around assentand consent. However, the value of waiting until adulthoodbefore implementing WGS is also debatable.

In addition, healthcare providers are unprepared for thedeluge of genomic data that WGS produces: they typically

Science & Society

Corresponding author: Sanderson, S.C. ([email protected]).

556

Page 5: Trends in genetics_-_october_2013

have minimal understanding of genomics and lack confi-dence in their ability to interpret genomic information fortheir patients. Some genomics education efforts for health-care providers are underway, but more are urgently needed.

New models of consent and return of results are neededAs Biesecker emphasized, WGS ‘is a resource, not a test’[12]. This is particularly true for healthy individuals. Inthe future, WGS results will not be offered at a singlemoment in time. Instead, the individual or clinician willinterrogate the data in different ways over time depend-ing on life-stage, circumstances, and evolving genomicsknowledge. This has implications for consent and counsel-ing because it poses a challenge to how informed consentis conceptualized. To make informed decisions aboutWGS, individuals should be helped to understand thepotential risks, benefits, and uncertainties of WGS, andthink fully through how potential results would makethem think, feel, and act. However, this is virtually im-possible when WGS results could pertain to any disease ortrait in the world, and the interpretation of the resultswill continue to evolve with ongoing research. Patientexpectations about the potential outcomes of WGS mustbe realistically set both during informed consent and viapublic education initiatives.

In addition to consent, models for the return of resultswill need to be modified. Traditional genetic counselingmodels involve hours of in-person education and supportfrom already overstretched genetic counselors [5], which isclearly unsustainable in this new context. Novel multi-media approaches to patient education are needed to helppatients make informed decisions about WGS [13], partic-ularly when there is no primary phenotype of immediateconcern. In addition, whether individual preferences re-garding return of specific WGS results should be taken intoaccount remains an open question. On the one hand, theACMG suggests that it is impractical to incorporate pa-tient preferences regarding incidental findings into theWGS process [11]. On the other, some investigators arealready building novel, dynamic, multi-media tools to as-sess and incorporate patient preferences into WGS pipe-lines [13] (http://www.my46.org).

Will WGS affect behaviors and emotions?Although early studies found little evidence that geneticrisk information influenced individual health behaviorssuch quitting smoking [14], these ‘proof-of-principle’ stud-ies tested for single variants of low penetrance, and it istherefore not surprising that there was little impact uponindividual perceptions of disease threat or subsequentmotivation to change behavior, given the small effects ondisease risk and the lack of objective clinical benefit thatcould be achieved from this knowledge. Our understandingof genomic influences on disease is rapidly increasing, how-ever, and current investigations in which complex, multi-scale personal information about healthy individuals isgenerated based on WGS information integrated with mul-tiple other ‘omics’ data [1,2] bear little resemblance to thoseearly studies in which individuals were tested for one single-nucleotide polymorphism (SNP) or variant of similarly lowpenetrance [14], or selection of SNP-based risk scores.

Similarly, early studies did not find significant emotion-al impacts from personal genomic information [15]. How-ever, again, these were not based on WGS, and there is fargreater potential for WGS to produce unanticipated resultsthat may be valued by one individual, but completelydevastating to another. The potential for emotional harmfrom WGS should not be underestimated – nor should it beoverstated. One trial funded by the US National Institutesof Health (NIH), the MedSeq Project (http://www.genome-s2people.org/the-medseq-project) is beginning to explorethese issues. More evidence from randomized trials withlarger samples of diverse populations is needed beforeconclusions about behavioral and emotional effects ofWGS on healthy individuals can be drawn.

Given the limited evidence-base today, the loud skepti-cism regarding the potential for genomic information tosucceed in motivating people to make health-protectivebehavioral changes where other efforts have failed is un-derstandable. Behavior change is unquestionably hard,but this should propel us to continue exploring whetherWGS together with other emerging self-monitoring andbig data applications will help change behaviors. It isimperative that we do this in an ethically-responsibleway that minimizes the potential for harms. The juryis still out, and the behavioral and emotional effects ofpersonal WGS information remain to be seen.

Equitable access for allMost healthy individuals who have had their genomessequenced to date are early adopters, scientists experi-menting on themselves, or people with the means andresources to obtain WGS through initiatives such as theIllumina Understand Your Genome conferences (http://www.understandyourgenome.com). This self-experimen-tation is valuable while pipelines are still being builtand challenges regarding results communication are stillbeing tackled. Simultaneous efforts are needed, however,to ensure that WGS does not contribute to the already widehealth disparities across society. The declining costs ofWGS will undoubtedly be pivotal, as will efforts alreadyunderway to broaden genomics research to include under-represented populations. Furthermore, explicit efforts areneeded to ensure that informed consent procedures areaccessible and appropriate for people with lower literacylevels, patient education materials are developed that areaccessible and understandable, results are communicatedin ways that are easy to understand by people across aspectrum of educational attainment, and WGS is accessi-ble to individuals from all walks of life, not only those withthe greatest resources. Only then will the promise of WGSbe truly realized.

AcknowledgmentsI am deeply indebted to Barbara Biesecker, Robert Green, Muin Khoury,Eric Schadt, Jo Waller, and Ron Zimmern for their valuable feedback onan earlier draft of this article.

References1 Chen, R. et al. (2012) Personal omics profiling reveals dynamic

molecular and medical phenotypes. Cell 148, 1293–13072 Patel, C.J. et al. (2013) Whole genome sequencing in support of

wellness and health maintenance. Genome Med. 5, 58

Science & Society Trends in Genetics October 2013, Vol. 29, No. 10

557

Page 6: Trends in genetics_-_october_2013

3 Angrist, M. (2009) Eyes wide open: the personal genome project, citizenscience and veracity in informed consent. Pers. Med. 6, 691–699

4 Burn, J. (2013) Should we sequence everyone’s genome? Yes. BMJ 346,3133

5 Brunham, L.R. and Hayden, M.R. (2012) Whole-genome sequencing:the new standard of care? Science 336, 1112–1113

6 Flinter, F. (2013) Should we sequence everyone’s genome? No. BMJ346, 3132

7 Ormond, K.E. et al. (2010) Challenges in the clinical application ofwhole-genome sequencing. Lancet 375, 1749–1751

8 Evans, J.P. et al. (2013) We screen newborns, don’t we? Realizing thepromise of public health genomics. Genet. Med. 15, 332–334

9 Cassa, C.A. et al. (2012) Disclosing pathogenic genetic variants toresearch participants: quantifying an emerging ethical responsibility.Genome Res. 22, 421–428

10 Schadt, E.E. (2012) The changing privacy landscape in the era of bigdata. Mol. Syst. Biol. 8, 612

11 Green, R.C. et al. (2013) ACMG recommendations for reporting ofincidental findings in clinical exome and genome sequencing. Genet.Med. 15, 565–574

12 Biesecker, L.G. (2012) Opportunities and challenges for the integrationof massively parallel genomic sequencing into clinical practice: lessonsfrom the ClinSeq project. Genet. Med. 14, 393–398

13 Yu, J.H. et al. (2013) Self-guided management of exome and whole-genome sequencing results: changing the results return model. Genet.Med.

14 Marteau, T.M. et al. (2010) Effects of communicating DNA-baseddisease risk estimates on risk-reducing behaviours. CochraneDatabase Syst. Rev. 10, CD007275

15 Bloss, C.S. et al. (2011) Effect of direct-to-consumer genomewideprofiling to assess disease risk. N. Engl. J. Med. 364, 524–534

0168-9525/$ – see front matter � 2013 Elsevier Ltd. All rights reserved.

http://dx.doi.org/10.1016/j.tig.2013.08.005 Trends in Genetics, October 2013, Vol. 29, No. 10

Science & Society Trends in Genetics October 2013, Vol. 29, No. 10

558

Page 7: Trends in genetics_-_october_2013

LongevityMap: a database of human genetic variantsassociated with longevity

Arie Budovsky1,2*, Thomas Craig3*, Jingwei Wang3*, Robi Tacutu3, Attila Csordas4,Joana Lourenco3, Vadim E. Fraifeld1, and Joao Pedro de Magalhaes3*

1 The Shraga Segal Department of Microbiology, Immunology and Genetics, Center for Multidisciplinary Research on Aging,

Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel2 Judea Regional Research and Development Center, Carmel 90404, Israel3 Integrative Genomics of Ageing Group, Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK4 European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus,

Hinxton, Cambridge CB10 1SD, UK

Understanding the genetic basis of human longevityremains a challenge but could lead to life-extendinginterventions and better treatments for age-related dis-eases. Toward this end we developed the LongevityMap(http://genomics.senescence.info/longevity/), the firstdatabase of genes, loci, and variants studied in thecontext of human longevity and healthy ageing. Wedescribe here its content and interface, and discusshow it can help to unravel the genetics of humanlongevity.

Given the worldwide ageing of the population, studyingthe genetics of human longevity is of widespread impor-tance [1,2]. Longevity is moderately heritable in humans(�25%), with increasing heritability with age [1], andexceptional longevity and healthy ageing in humans isan inherited phenotype [3]. Hundreds of longevity associ-ation studies have been performed in recent years andsome genes associated with human longevity may besuitable targets for drug development [4]. Nonetheless,the heritability of human longevity remains largely unex-plained in part due to the complexity of this phenotypictrait [1]. Thanks to advances in next-generation sequenc-ing and genome-wide approaches, the capacity of longevityassociation studies is increasing. The growing amounts ofdata being generated also increase the complexity of thedata analysis and the difficulty of placing findings incontext of previous studies. We created the LongevityMap(http://genomics.senescence.info/longevity/), the first cat-alogue of human genetic variants associated with longevi-ty, to serve as a reference to help researchers navigate therising tide of data related to human longevity.

The LongevityMap is a new addition to our alreadyhighly successful collection of online databases and toolson the biology and genetics of ageing, the Human AgeingGenomic Resources (http://genomics.senescence.info/) [5].GenAge, our existing database of ageing-related genes,

focuses mostly on genes modulating longevity in modelorganisms plus the few genes associated with humanprogeroid syndromes [5], and thus there is an unmet needfor a database of human genetic variants associated withlongevity. As such, we followed the high standards andrigorous procedures of GenAge to develop the Longevity-Map. Briefly, all entries in the LongevityMap were manu-ally curated from the literature. Studies were selectedfollowing an in-depth literature survey. The LongevityMapis an inclusive database in which both large and smallstudies are included; different types of study are featured,from cross-sectional studies to studies of extreme longevity(e.g., centenarians). However, studies focused on cohorts ofunhealthy individuals at baseline, such as cancer patients,were excluded. Details on study design are provided foreach entry, including a brief description of the type ofstudy, population ethnicity, sample size, age of probandsand controls, and any gender bias. Negative results arealso integrated in the LongevityMap to provide visitorswith as much information as possible regarding each gene,variant, and locus previously studied in the context oflongevity. Each entry refers to a specific observation froma study. This means that studies, and large-scale studies inparticular, can have multiple entries in the LongevityMap,reflecting different results and observations. Each entryalso includes a brief description of the major conclusions.Entries are flagged regarding whether results were sta-tistically significant or not, though many studies havemarginal or indicative results that require a brief expla-nation of the findings. Our policy concerning controversialand subjective results is to detail the facts concerning thecontroversy and let users form their own opinions. A link tothe primary publication in PubMed is always included ineach entry.

We developed an intuitive, user-friendly interface forthe LongevityMap that allows users to query genes,variants (including by reference SNP ID number), stud-ies, and cytogenetic locations (Figure 1A). Users canbrowse/filter the data by association (i.e., significant ornon-significant), population, and chromosome. For eachsingle nucleotide polymorphism (SNP) and gene, addi-tional annotation was retrieved from the US NationalCenter for Biotechnology Information (NCBI) databasesdbSNP and RefSeq [6] to provide further information on

Spotlight

Corresponding author: de Magalhaes, J.P. ([email protected]).Keywords: ageing; genetics; GWAS; humans; lifespan; polymorphisms.

* These authors contributed equally to this work.

559

Page 8: Trends in genetics_-_october_2013

genes associated with SNPs and gene function, respec-tively. Homologues in model organisms were obtainedfrom the InParanoid database [7]. Links are widelyimplemented to allow users to identify quickly otherentries related to a given study, gene, or variant. In fact,each gene in the LongevityMap has a gene-centric pagethat aggregates and condenses the information on thedatabase taken from different studies. In addition, theLongevityMap is fully integrated with our other ageing-related databases to provide users with selected, relevantinformation. In particular, crosslinks to GenAge are in-cluded to indicate genes associated with progeroid syn-dromes and those with homologues in model organismsknown to modulate ageing/longevity. If appropriate, linksto other major databases, such as Ensembl, Swiss-Prot,dbSNP, HapMap, and NCBI Entrez, are included for eachentry. At time of writing, the LongevityMap includes datafrom 246 studies, featuring 751 different genes and 1987variants (Figure 1B). Similarly to our other ageing-relat-ed databases, the LongevityMap is freely available onlineunder a Creative Commons Attribution license. The fulldataset is available for download and third-party use. Itis our hope that the LongevityMap will serve as a noveldatabase to help researchers decipher the genetics ofhuman longevity.

AcknowledgementsThe authors wish to thank Joana Costa, Daniel Wuttke, and Alex Freitasfor helping to collate data and for comments and suggestions. This workwas funded by a Wellcome Trust grant (ME050495MES) to J.P.M. Thiswork was also funded in part by the European Union Framework Program(FP) 7 Health Research Grant number HEALTH-F4-2008-202047 (toV.E.F.) and the Israel Ministry of Science and Technology (to A.B.). J.P.M.is also grateful for support from the Ellison Medical Foundation and R.T. issupported by a Marie Curie Intra-European Fellowship within FP7.

References1 Christensen, K. et al. (2006) The quest for genetic determinants of

human longevity: challenges and insights. Nat. Rev. Genet. 7, 436–4482 Chung, W.H. et al. (2010) The role of genetic variants in human

longevity. Ageing Res. Rev. 9 (Suppl. 1), S67–S783 Atzmon, G. et al. (2005) Biological evidence for inheritance of exceptional

longevity. Mech. Ageing Dev. 126, 341–3454 de Magalhaes, J.P. et al. (2012) Genome–environment interactions that

modulate aging: powerful targets for drug discovery. Pharmacol. Rev.64, 88–101

5 Tacutu, R. et al. (2013) Human ageing genomic resources: integrateddatabases and tools for the biology and genetics of ageing. Nucleic AcidsRes. 41, D1027–D1033

6 NCBI Resource Coordinators (2013) Database resources of the NationalCenter for Biotechnology Information. Nucleic Acids Res. 41, D8–D20

7 Ostlund, G. et al. (2010) InParanoid 7: new algorithms and tools foreukaryotic orthology analysis. Nucleic Acids Res. 38, D196–D203

0168-9525/$ – see front matter � 2013 Elsevier Ltd. All rights reserved.

http://dx.doi.org/10.1016/j.tig.2013.08.003 Trends in Genetics, October 2013, Vol. 29, No. 10

Entries significantly associated with longevityEntries not significantly associated with longevityTotal entriesGenesVariantsStudies

Type of data

(A)

(B) Number

2492555047511987 (1832 with a refSNP number)246

TRENDS in Genetics

Figure 1. LongevityMap home page which showcases the design and layout of the website as well as its multiple search options and links (A); old couple picture by Jonel

Hanopol. Types and amount of data in the LongevityMap (B).

Spotlight Trends in Genetics October 2013, Vol. 29, No. 10

560

Page 9: Trends in genetics_-_october_2013

The role of gene conversion inpreserving rearrangement hotspots inthe human genomeJeffrey A. Fawcett and Hideki Innan

Graduate University for Advanced Studies, Hayama, Kanagawa 240-0193, Japan

Hotspots of non-allelic homologous recombination(NAHR) have a crucial role in creating genetic diversityand are also associated with dozens of genomic disor-ders. Recent studies suggest that many human NAHRhotspots have been preserved throughout the evolutionof primates. NAHR hotspots are likely to remain active aslong as the segmental duplications (SDs) promotingNAHR retain sufficient similarity. Here, we propose anevolutionary model of SDs that incorporates the effect ofgene conversion and compare it with a null model thatassumes SDs evolve independently without gene con-version. The gene conversion model predicts a muchlonger lifespan of NAHR hotspots compared with thenull model. We show that the literature on copy numbervariants (CNVs) and genomic disorders, and also theresults of additional analysis of CNVs, are all moreconsistent with the gene conversion model.

Many rearrangement hotspots are shared acrossspeciesRecombination is a major mutational mechanism thathas a crucial role in producing genetic diversity. Becauseof its potential impact on important phenotypes, includ-ing diseases, much attention has been paid to recombina-tion, whether it is allelic or nonallelic [1,2]. To understandthe interaction between recombination and phenotypes, itis important to know how different parts of the genomediffer in the rate at which recombination occurs. Recentgenome-wide surveys demonstrated that the distributionof the recombination rate across the genome is far fromuniform. Instead, there are several hotspots where re-combination occurs at a much higher rate than in the restof the genome [3,4]. This applies to both allelic andnonallelic recombination [5]. Given that these hotspotsare especially important in producing genetic diversity, agood understanding of their characteristics should beextremely valuable.

Evolutionary approaches provide a means to investigatehow these hotspots arose and have been maintainedthroughout evolution, which might enable us to better pre-dict regions that affect the phenotype. A recent interesting

finding is that most allelic recombination hotspots detectedin the human genome do not exist in the chimpanzeegenome, indicating a rapid turnover of hotspots [6,7]. Thisrapid turnover is at least partly because hotspots are largelydetermined by the fast-evolving PR domain-containing 9(Prdm9) gene. This gene encodes a protein that containsseveral zinc finger domains and is able to bind motifs thatare overrepresented in recombination hotspots [8]. Singlemutations in Prdm9 or its binding motif can be sufficient toalter the recombination activity [9–11]. This means thathotspots are determined by human-specific factors, whichultimately raises the question of whether studying thegenomes of other primate species would be useful in under-standing the role of recombination in shaping the pattern ofgenetic diversity in the human genome.

The situation seems to be different for hotspots ofnonallelic recombination, the major cause of genomic rear-rangements such as duplications, deletions, and inver-sions. Recent studies of CNVs in various primate specieshave shown that CNV hotspots are often shared acrossspecies, even between human and macaque [12–15]. Thissuggests that nonallelic recombination hotspots have alonger lifespan than do hotspots of allelic recombination.This is related to the key mechanism of nonallelic recom-bination, that is, NAHR. Highly similar homologoussequences, or segmental duplications (SDs), serve as sub-strates for NAHR, which causes the duplication or deletionof the intervening region (or inversion in the case ofinverted SDs) (Figure 1A). Although nonallelic recombina-tion pathways other than NAHR also have a large role ingenerating CNVs [16,17], it is thought that NAHR hotspotsremain active for a longer period of time and are largelyresponsible for generating recurrent rearrangements. Forthe sake of clarity, here we define NAHR hotspots as SDpairs that are initiating recurrent NAHR. Therefore, eachnew duplication creates a new potential hotspot even ifthey occur in neighboring regions that could be consideredas the same fragile region, sometimes making a complicat-ed nested structure of multiple duplications. We also as-sume that a long (e.g., >200 bp) stretch of perfect identityshared between the SD pair is crucial for the maintenanceof the hotspot. NAHR can sometimes occur even when theperfect match is short, and the rate may also be influencedby other factors (e.g., distance between the SDs or recom-binogenic sequence motifs) [3,18,19]. However, a long iden-tical stretch is known to enhance greatly the efficiency of

Opinion

0168-9525/$ – see front matter

� 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.07.002

Corresponding author: Innan, H. ([email protected]).Keywords: gene conversion; non-allelic homologous recombination; rearrangementhotspot; segmental duplication; copy number variant.

Trends in Genetics, October 2013, Vol. 29, No. 10 561

Page 10: Trends in genetics_-_october_2013

NAHR [18,20,21], which is predicted to be crucial forrepeatedly generating rearrangements over a long periodof time. Thus, whereas allelic recombination hotspots arelargely determined by the PRDM9 motif and a smallnumber of mutations are sufficient to cause turnovers,NAHR hotspots will potentially remain active as long asthe SD contains a subregion with sufficient similarity andlength. Indeed, CNV hotspots are enriched for SDs [12,14],and it has been suggested that the long-term evolution ofhotspots is determined by the birth-and-death process ofmatching pairs of SDs [22].

An important question then is how long is the expectedlifespan of an individual NAHR hotspot. We consider twoevolutionary models that give different predictions regard-ing the lifespan of hotspots. The first is the turnover model(Figure 1B), which assumes that SDs accumulate mutationsindependently. According to this model, the divergencebetween the SDs increases in proportion to time and theSDs lose their ability to initiate NAHR as they become toodivergent. Consequently, the hotspots are subject to a rapidturnover, and new SDs must constantly arise for the genometo maintain a certain number of hotspots. Thus, the turn-over model predicts that hotspots would be shared onlyamong closely related species and not between distantlyrelated species, as has been previously suggested [22]. In thecase of primates, the model predicts that it would be unlikelyfor hotspots to remain active for more than 25 million years,or since the divergence of human and macaque (Box 1).Therefore, the turnover model might not be sufficient toexplain recent findings where several CNV hotspots areshared between human and macaque [13,15].

A model incorporating gene conversion better explainsthe evolution of CNV hotspotsAn alternative, which we propose here, is the geneconversion model (Figure 1C). This model predicts the

long-term preservation of hotspots and is supported boththeoretically and empirically. The model takes into ac-count the effect of gene conversion, a recombinationalmechanism that can retard the divergence betweenSDs. Ongoing gene conversion results in the SDs main-taining high similarity for a long period of time. There isincreasing evidence for gene conversion between SDs invarious species, including humans [23,24]. It is easy toimagine that gene conversion would provide an idealsubstrate for NAHR, as has been previously suggested[25]. The gene conversion model predicts that a largernumber of older SDs would be associated with the currenthotspots compared with the turnover model (Box 1). Thepotential role of gene conversion in preserving hotspotshas been suggested by several case studies [25–28]. Anextreme case is the polymorphic inversion on the humanchromosome Xq28 region containing the filamin A(FLNA) and emerin (EMD) loci that is probably causedby NAHR between inverted duplicates. It was found thatthis pair of inverted duplicates is shared by variouseutherian lineages and that these duplicates have recur-rently caused inversions in independent lineages (at leastten times since the origin of eutherians) [27]. The se-quence identity between the duplicates was found to behigh in each species. Based on these observations, it wassuggested that gene conversion has been homogenizingthe duplicates, thus preserving the activity as a hotspot,for at least 100 million years. Another study [13], whichidentified several macaque CNVs, suggests that thismodel is applicable to some CNV hotspots in primates.Three CNV regions were identified that were sharedbetween human and macaque where the flanking match-ing SD pairs in both species were clearly orthologous. Inall three cases, the paralogous copies were more closelyrelated to each other than to the orthologous copies. Thisindicates that gene conversion has been maintaining high

Ac�ve hotspot Ac�ve hotspot

Divergence Gene conversion

S�ll ac�veNew hotspot

(A)

(B) (C)

No more NAHR

TRENDS in Genetics

Figure 1. Diagram of non-allelic homologous recombination (NAHR) hotspots and two models of their evolution. (A) Illustration of NAHR between tandem segmental

duplications (SDs; green arrows) that results in the duplication or deletion of the intervening region (the outcome would be an inversion if the SDs are in inverted

orientation). Two models could explain the evolution of NAHR hotspots. (B) The turnover model assumes that the two SD copies diverge in proportion to time and, thus,

quickly become unable to initiate NAHR. Therefore, new hotspots must constantly arise for a certain number of hotspots to remain in the genome. (C) The gene conversion

model considers the effect of paralogous gene conversion, which maintains the similarity between the two copies. Therefore, the SD is able to initiate NAHR for a much

longer period of time.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

562

Page 11: Trends in genetics_-_october_2013

similarity, thereby preserving the ability to initiateNAHR in both lineages for more than 25 million years[13].

Based on further analyses on primate CNV hotspots, weshow here that most SD-associated CNV hotspots aremore consistent with the gene conversion model than withthe turnover model. We examined a previously publisheddata set [15], which contains CNVs identified by previouslarge-scale population surveys [29,30], and identified 79cases where both ends of the CNV regions (i.e., break-points) lie within matching SD pairs reported in thesegmental duplications database [31,32]. We assume thatthese CNVs were likely formed by NAHR between theflanking SDs. We first looked at the average nucleotidedivergence over the entire region. The divergence washigher than the average human–chimpanzee divergence(approximately 1.3%) for almost all SDs and higher thanthe average human–-macaque divergence (approximately6%) for approximately one-third of the SDs (Figure 2A).

The actual ages of the SDs could be even older becausegene conversion retards their divergence. Indeed, if welook at the spatial distribution of the divergence, most ofthe 79 SDs show a nonuniform distribution and containidentical stretches that are significantly longer thanexpected (70/79 at P <0.05; 43/70 at P <0.0001).Figure 2 clearly shows that the longest identical stretchesof the observed data are much longer than those of the nulldata with the same level of divergence.

Gene conversion is the most likely mechanism respon-sible for creating these unexpectedly long stretches ofperfect identity within the SDs (see Box 2 for a detaileddiscussion on the divergence process of SDs undergoinggene conversion). The action of gene conversion betweenthe matching SD pairs can be better demonstrated by acomparative genomics approach where SD sequences ofmultiple species are compared [23,33]. Consider an SD pairin human, Xh and Yh and their orthologs in chimpanzee,Xc and Yc. Gene conversion will create sites where Xh and

Box 1. The lifespan of NAHR hotspots under the turnover model and the gene conversion model

How long are NAHR hotspots expected to remain in the genome? The

gene conversion model predicts that hotspots will remain active for a

longer period of time compared with the null turnover model. We

illustrate this using a simple computation. The time period is measured

by the probability that the SD pair retains an identical stretch of

�200 bp. Under the turnover model, we consider three different lengths

of the SD (1, 10, and 100 kb). Although the requirement of �200-bp

perfect identity is a simplified assumption, this computation provides

an approximation of how long a hotspot should remain active and how

gene conversion affects its longevity. We note that using different

length requirements and changing the values of the parameters shown

in Figure I do not affect the overall pattern.

As shown in Figure I (red, green, and blue lines for 1, 10, and 100 kb,

respectively), the probability quickly drops, especially when the

length of the SD is short. A hotspot as old as the human–chimpanzee

divergence is still likely to be active (unless short), whereas a hotspot

as old as the human–macaque divergence (approximately 25-million

years old) is highly unlikely to be active (even for an SD as long as

100 kb) (Figure I). Thus, the lifespan of a hotspot in primates is likely

to be between 5 and 25 million years under the turnover model with

no gene conversion.

The situation dramatically changes under the gene conversion

model. We added the effect of gene conversion using three different

gene conversion rates for the case of a 10-kb SD (shown by green-

dashed lines in Figure I). Including the effect of gene conversion

increases the probability that NAHR will still occur after a given

amount of time, especially when the rate of gene conversion is high.

The rate of gene conversion should be highly variable because it is

determined by several factors [60]. Thus, gene conversion can

substantially increase the longevity of an NAHR hotspot.

0.0

0.2

0.4

0.6

0.8

1.0

Time (million years)

Prob

abili

ty

0 10 20 30

Chimp Orangutan Macaque

1kbc = 0

10kb c = 0

100kbc = 0

10kbc = 5 × 10−8

10kbc = 3 × 10−8

10kbc = 1 × 10−8

TRENDS in Genetics

Figure I. The probability that a given segmental duplication (SD) pair of 1 kb, 10 kb, and 100 kb (red, green, and blue lines, respectively) will retain an identical stretch of

�200 bp based on 10 000 simulation runs. The expected probability was calculated by a simulation following the model in [61]. The model assumes random

accumulation of point mutations at a rate of 10�9/site/generation and that gene conversion occurs at a given rate c per site (see [61] for details). The red, green, and blue

solid lines represent simulation results of SDs of 1 kb, 10 kb, and 100 kb when c = 0, and the green-dashed lines represent results of a 10-kb SD when c = {1,3,5} � 10�8

with an average tract length of 1 kb (1/Q = 0.1 in [61]) representing low, intermediate, and high gene conversion rates. The vertical gray lines approximately correspond

to the divergence between human and chimpanzee, orangutan, and macaque.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

563

Page 12: Trends in genetics_-_october_2013

Yh share the same nucleotide and Xc and Yc share anothernucleotide. Although strong purifying selection can alsocreate regions of low divergence, significant clustering ofsuch sites cannot be explained by selection and is consid-ered a strong signature of gene conversion [33,34]. Despitethe genomic regions containing SDs often being poorlysequenced and/or assembled in nonhuman species, wewere able to identify both copies of the SDs in the genomeof another primate species for 35 out of the 79 cases. Inalmost all of those cases (34/35), we found regions thatshowed strong signatures of gene conversion. These resultssuggest that gene conversion and the retention of regions ofperfect identity are common features of SD pairs in CNVregions, which directly results in the long-term preserva-tion of the CNV hotspots detected by population surveys,that is, common CNVs.

The gene conversion model also applies to regionsassociated with genomic disordersDoes this typical pattern also apply to CNVs that causegenomic disorders, whose frequencies are often too low tobe detected by a population survey? According to theliterature, the answer seems to be yes. Dozens of ‘known’disorders are often caused by NAHR between SDs (alsoreferred to as low copy repeats) [17,35–37]. For 14 of them,we were able to identify unambiguously SDs containingNAHR breakpoints in the current human genome assem-bly (Table 1). These included two well-studied caseswhere both copies of the matching SD pair have beenidentified in other primate genomes and the action of geneconversion has been documented. One is the deletionof the azoospermia factor a (AZFa) locus on chromosome

Y that is associated with male infertility (Table 1, #1).This locus is flanked by direct repeats and both copies arepresent in the orthologous regions of chimpanzee andgorilla [25]. The rearrangement breakpoints map to twospecific regions within the duplicates. One region shows1285 bp of perfect identity and the other contains onesingle mismatch over 1609 bp, despite some other regionsshowing <90% identity. Strong signatures of gene con-version were reported in these two breakpoint regions[25,38]. The other example is the coagulation factor VIII(F8) locus, which contains two pairs of inverted repeats(Table 1, #2). Inversion between either pair causes hemo-philia A. Despite originating before the divergence ofhuman and African green monkey (and, thus, macaque),both pairs exhibit >99% identity [26]. It is interesting tonote that hemophilia A caused by the inversion of thesame region due to NAHR has also been reported in dog,although it is not clear whether the inversion is mediatedby repeats ancestral to human and dog [39].

In addition to these two cases, we found five cases inwhich the orthologous copies of the matching SD pairscould be identified in at least one of the chimpanzee,orangutan, or macaque genomes (Table 1, #3–7). EachSD pair exhibited evidence of gene conversion. One inter-esting case is the SD pair associated with IncontinentiaPigmenti (Table 1, #7), a severe X-linked disorder that islethal in males. The main cause of this disease is agenomic deletion that eliminates exons 4–10 of the inhib-itor of kappa light polypeptide gene enhancer in B-cells,kinase gamma (NEMO/IKBKG) gene, which is located onXq28. This deletion is caused by NAHR between twoidentical MER67B repeated sequences of 878 bp, one

0.00 0.05 0.10 0.15

050

010

0015

0020

0025

00Observed distribu�on

Nucleo�de divergence

Obs

erve

d lo

nges

t ide

n�ca

l str

etch

(bp)

P ≥ 0.05Key:

P < 0.05P < 0.01P < 0.0001

Chimp Orangutan Macaque

0.00 0.05 0.10 0.15

050

010

0015

0020

0025

00

Null distribu�on

Nucleo�de di vergence

Expe

cted

long

est i

den�

cal s

tret

ch (b

p)

Chimp Orangutan Macaque

(B) (A)

TRENDS in Genetics

Figure 2. The probability for observing the longest identical stretch present in the segmental duplications (SDs) flanking the copy number variants (CNVs). (A) The observed

longest identical stretch (bp) within each SD pair flanking a CNV region is plotted against the divergence level. The significance of the observed length for each SD was

evaluated by creating 10 000 random patterns of divergence where the diverged nucleotide positions are distributed randomly across the entire SD, and are shown as filled

squares, triangles, and circles when significant (P <0.05, <0.01, and <0.0001, respectively), and by open circles when not significant. (B) Typical distribution of the longest

identical stretch in the randomized data used for evaluating the significance in (A). Only some of the data are shown to demonstrate the point. The vertical gray lines show

the time corresponding to the average genome-wide nucleotide divergence between human and chimpanzee, orangutan, and macaque [62].

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

564

Page 13: Trends in genetics_-_october_2013

located in intron 3 and the other located downstream ofthe last exon of NEMO [40,41]. Both copies were presentin the orthologous regions of the genomes of chimpanzee,orangutan, and macaque. The two copies show >99%similarity in all species and exhibit strong signaturesof gene conversion. This indicates that gene conversionhas maintained the genomic configuration that predis-poses carriers to severe disorders (at least in humans) formore than 25 million years. This Xq28 region containsseveral other extreme examples of extensive homogeni-zation of ancient duplicates within approximately 1 Mb.The F8 locus associated with hemophilia A [26] and theinverted repeats at the FLNA–EMD locus [27] (both dis-cussed above), as well as the red- and green-opsin geneduplicates undergoing frequent gene conversion [42], areall in this region. Thus, the rate of gene conversion couldbe elevated in this region.

Several other genomic disorders, such as Williams–Beuren syndrome, Smith–Magenis syndrome, neurofibro-matosis type 1 (NF1), and DiGeorge/velocardiofacial syn-drome (Table 1, #8–11), are caused by NAHR between SDsthat are present in multiple copies in other primate gen-omes [43–48]. These reports are based on fluorescent insitu hybridization (FISH), and the ages of the exact copiesinvolved in NAHR in humans are not clear. Nevertheless,strong signatures of gene conversion around the break-point regions of the SDs have been reported for all fourcases [49–52]. For instance, many of the breakpoints ofNAHR associated with NF1 map to a region within the 51-kb SD that shows elevated sequence identity, probably dueto gene conversion, including a 700-bp identical stretch[50]. Also, several polymorphic sites shared by both SDcopies, which are strong signatures of gene conversion,were detected around the breakpoint region of the SDs

Box 2. Divergence pattern of a segmental duplication undergoing gene conversion

How do SDs evolve when gene conversion frequently occurs?

Following a duplication event, the divergence will remain at a low

equilibrium as long as gene conversion is ongoing (see [61] for

details). The accumulation of mutations or large indels will result in

the termination of gene conversion and the increase of divergence in

that region, whereas concerted evolution will continue in other

regions. Regions undergoing gene conversion within the SD will

decrease as time proceeds (Figure I). Future work will be needed to

reveal the process that determines which region within the SD retains

high similarity. One possibility is that any region within the SD can

potentially retain high similarity because indels and point mutations

accumulate randomly across the SD. Therefore, the ongoing or

termination of gene conversion will occur randomly across the SD.

Under this scenario, if we consider an SD pair that is shared among

species, we would also expect that gene conversion would be

ongoing in different regions of the SDs in each species (Figure IA).

Note that when multiple species are compared, the homogenized

regions will not be distributed completely randomly because of their

shared evolutionary history.

We can also imagine an alternative scenario where specific regions

undergo homogenization for a long period of time. If the same

specific region of the two copies is under selective constraint, the

divergence will remain low within that region, which will make it

more likely for gene conversion to occur. Also, gene conversion might

be favored in a specific region if the retention of high similarity of that

region has some functional benefit. The rate of gene conversion could

also be elevated locally due to, for example, the DNA structure or the

presence of certain motifs. Under this nonrandom scenario, gene

conversion might continue to occur at the same specific region in

different species even long after their divergence (Figure IB).

(A)

(B)

Human

Chimp

Orangutan

Human

Chimp

Orangutan

TRENDS in Genetics

Figure I. Illustration of how duplicates diverge in the presence of gene conversion. The green bars represent regions within the segmental duplications (SDs) that are

undergoing gene conversion. Regions undergoing gene conversion gradually decrease due to large indels or the accumulation of mutations. (A) Scenario where the

termination of gene conversion occurs randomly throughout the SD. Regions undergoing gene conversion in each species differ, although they are not entirely

independent due to their shared history. (B) Scenario where selection favors ongoing gene conversion in specific regions (blue bar) due to some functional constraint.

The continuation and termination of gene conversion is not random, and the same region likely retains high similarity in each species.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

565

Page 14: Trends in genetics_-_october_2013

associated with DiGeorge/velocardiofacial syndrome [51].Thus, although we could not confirm the presence of bothSD copies in other primate genomes for seven cases, in-cluding these four (Table 1, #8–14), possibly because theseregions are repetitive and poorly assembled in other spe-cies, it is likely that gene conversion is involved in pre-serving the hotspots. In summary, the examples discussedhere clearly show that the gene conversion model applies toSDs associated with genomic disorders, even though therearrangements are pathological.

Concluding remarksHere, we have shown that most SD-associated CNV hot-spots have been preserved for a long period of time, muchlonger than hotspots of allelic recombination. Gene conver-sion appears to be having a key role in the preservation bymaintaining long stretches (e.g., several hundred bases) ofperfect identity within SD pairs that can serve as sub-strates for NAHR. This has implications in disease, be-cause the preservation often increases the risk ofpathological rearrangements. The preservation shouldbe determined by the balance between factors that causethe preservation (e.g., rate of gene conversion or selectionfavoring the preservation) and the reduction of fitnesscaused by the preservation (e.g., rate of NAHR or severityof the resulting disorder). Although the maintenance ofstretches of high similarity by gene conversion might bepromoted by selection due to a functional constraint insome cases, it is unlikely that all the homogenized regionsare functional. Rather, given that most of the breakpointsin Table 1 map to repeat regions, functional constraint maynot be the major contributor to the preservation. This isconsistent with the observation that regions within the

SDs being homogenized are different in each primatespecies. Thus, it seems most likely that CNV hotspots,in general, are preserved as a byproduct of gene conversionthat occurs at a high enough rate to override their negativeconsequences. Future work involving comparative analysisof sequences from multiple species and careful modeling ofthe divergence process of the SDs considering the effect ofgene conversion and selection should be valuable for betterunderstanding the different factors, including selection,that are responsible for the preservation of CNV hotspots(Box 2).

The preservation of rearrangement hotspots might havehad a key role in the adaptive evolution of humans. Recentstudies have identified several regions within the humangenome that comprise mosaic structures of duplicationsubunits (duplicons) as a result of recurrent duplica-tions-within-duplications. In particular, several ‘coreduplicons’ that have duplicated several times throughoutevolution and are shared across multiple duplicationblocks are known to contain primate-specific genes under-going positive selection [37,53,54]. Another recent studyshowed that CNV regions shared among human, chimpan-zee, and macaque (CNV hotspots) were significantly likelyto overlap with genic regions [15]. This is in stark contrastwith human-specific CNV regions, which are generallydepleted of genes. Furthermore, many of the genes thatoverlap with CNV hotspots are evolving under positiveselection, and some are evolving under balancing selectionin humans [15]. It has been suggested that the genomicplasticity in these hotspot regions has provided the muta-tional flexibility for the residing genes to adapt to changingselective pressures [15,37,55]. If so, we further suggest thatgene conversion has had an important role in maintaining

Table 1. The presence of duplicates flanking human genomic disorder regions in other species and the occurrence of geneconversion

No. Locus Candidate genesa Associated phenotypes Evolutionary originb Gene conversionc Refs

#1 Yq11 AZFa Male infertility Gorilla + [25,38]

#2 Xq28 F8 Hemophilia Ad African green monkey + [26]

#3 5q35 NSD1 Sotos syndrome Orangutan (macaque) ++ [63,64]

#4 15q24 MAN2C1, CYP11A1, STRA6 Growth retardation and microcephaly Orangutan ++ [65]

#5 16p11 MAPK3, MAZ, DOC2A,

SEZ6L2, HIRIP3

Autism Chimp ++ [66,67]

#6 17p11 PMP22 Charcot-Marie-Tooth type 1A Chimp ++ [68–70]

#7 Xq28 NEMO Incontinentia pigmenti Macaque ++ [40,41]

#8 7q11 GTF2I Williams–Beuren syndrome (macaque, gibbon) + [43,44,52]

#9 17p11 RAI1 Smith–Magenis syndrome (macaque) + [45,49]

#10 17q11 NF1 NF1 (gorilla) + [46,50]

#11 22q12 BCR, USP18, GGT DiGeorge/velocardiofacial syndrome (macaque) + [47,48,51,71]

#12 2q13 NPHP1 Familial juvenile nephronophthisis ND – [72]

#13 10q22-23 NRG3, GRID1, BMPR1,

SNCG, GLUD1

Cognitive and behavioral abnormalities ND – [73]

#14 17q23 TBX2, TBX4 Developmental delay and heart defects ND – [74]

aAbbreviations: BCR, breakpoint cluster region; BMPR1, bone morphogenetic protein receptor 1; CYP11A1, cytochrome P450, family 11, subfamily A, polypeptide 1; DOC2A,

double C2-like domains, alpha; GGT, gamma-glutamyl transferase; GLUD1, glutamate dehydrogenase 1; GRID1, glutamate receptor, ionotropic, delta 1; GTF2I, general

transcription factor II i; HIRIP3, HIRA interacting protein 3; MAN2C1, mannosidase, alpha, class 2C, member 1; MAPK3, mitogen-activated protein kinase 3; MAZ, MYC-

associated zinc finger protein; NPHP1, nephronophthisis 1; NRG3, neuregulin 3; NSD1, nuclear receptor binding SET domain protein 1; PMP22, peripheral myelin protein 22;

RAI1, retinoic acid induced 1; SEZ6L2, seizure related 6 homolog (mouse)-like 2; SNCG, synuclein, gamma; STRA6, stimulated by retinoic acid 6; TBX, T-box; USP18,

ubiquitin specific peptidase 18.

bThe most distant species from human in which the duplicates were confirmed to be present based on genomic sequences are listed. Those not based on genomic

sequences (e.g. FISH signals) are shown in brackets. Those identified in this study are in bold. ‘ND’ denotes those where the presence of both copies could not be confirmed

in the genome of chimpanzee, orangutan, or macaque.

c+ indicates duplicates where gene conversion has likely occurred; ++ indicates those that are based on this study.

dCaused by inversion due to NAHR between inverted duplicates. The remaining disorders are all caused by deletions due to NAHR between duplicates in direct orientation.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

566

Page 15: Trends in genetics_-_october_2013

genomic plasticity, which most likely contributed to theadaptive evolution of the human lineage.

Almost all the duplicates we examined here showedevidence of gene conversion. This might seem at odds withprevious studies that detected gene conversion in onlyapproximately 10–15% of human duplicated gene pairs[56,57]. However, these studies did not focus on duplicatesof low divergence (e.g., <5% divergence) that are eitheryoung or undergoing extensive gene conversion. We predictthe fraction of recently duplicated sequences containingregions still undergoing gene conversion to be substantial-ly higher. Indeed, a study analyzing 30 multiple align-ments of human duplicated sequences of <4% nucleotidedivergence found evidence of sequence exchange due togene conversion or unequal crossing over in all 30 align-ments [58]. A recent population survey of CNVs in multi-copy gene families also reported several cases of geneconversion [59]. Thus, there could be a large number ofnearly identical regions undergoing gene conversion with-in the genome, especially in SDs that are located close toeach other. These regions could be acting as rearrange-ment hotspots that are yet to be identified.

The accumulating genomic data of human populationand other primate species should enable us to identify suchregions undergoing gene conversion. This should be a pow-erful approach to detect potential hotspots of genetic dis-orders that are difficult to detect due to their low frequenciesin the human population. In this respect, we note that manyhotspot regions are likely to be missed by low-coveragegenomes or resequencing studies because they are oftenhighly repetitive. Thus, more high-quality reference gen-omes from nonhuman primates and also multiple humanindividuals in the future should be valuable in understand-ing perhaps the most important genomic regions in terms ofhuman disease and human evolution.

AcknowledgmentsWe thank K. Teshima for technical help. This work is supported by agrant from Japan Society for the Promotion of Science (JSPS) to H.I.J.A.F. is a JSPS postdoctoral fellow.

References1 Coop, G. and Przeworski, M. (2007) An evolutionary view of human

recombination. Nat. Rev. Genet. 8, 23–342 Webster, M.T. and Hurst, L.D. (2012) Direct and indirect consequences

of meiotic recombination: implications for genome evolution. TrendsGenet. 28, 101–109

3 Myers, S. et al. (2005) A fine-scale map of recombination rates andhotspots across the human genome. Science 310, 321–324

4 Ptak, S.E. et al. (2005) Fine-scale recombination patterns differbetween chimpanzees and humans. Nat. Genet. 37, 429–434

5 Myers, S. et al. (2008) A common sequence motif associated withrecombination hot spots and genome instability in humans. Nat.Genet. 40, 1124–1129

6 Winckler, W. et al. (2005) Comparison of fine-scale recombination ratesin humans and chimpanzees. Science 308, 107–111

7 Auton, A. et al. (2012) A fine-scale chimpanzee genetic map frompopulation sequencing. Science 336, 193–198

8 Ponting, C.P. (2011) What are the genomic drivers of the rapidevolution of PRDM9? Trends Genet. 27, 165–171

9 Baudat, F. et al. (2010) PRDM9 is a major determinant of meioticrecombination hotspots in humans and mice. Science 327, 836–840

10 Myers, S. et al. (2010) Drive against hotspot motifs in primatesimplicates the PRDM9 gene in meiotic recombination. Science 327,876–879

11 Parvanov, E.D. et al. (2010) Prdm9 controls activation of mammalianrecombination hotspots. Science 327, 835

12 Perry, G.H. et al. (2008) Copy number variation and evolution inhumans and chimpanzees. Genome Res. 18, 1698–1710

13 Lee, A.S. et al. (2008) Analysis of copy number variation in the rhesusmacaque genome identifies candidate loci for evolutionary and humandisease studies. Hum. Mol. Genet. 17, 1127–1136

14 Gazave, E. et al. (2011) Copy number variation analysis in the greatapes reveals species-specific patterns of structural variation. GenomeRes. 21, 1626–1639

15 Gokcumen, O. et al. (2011) Refinement of primate copy numbervariation hotspots identifies candidate genomic regions evolvingunder positive selection. Genome Biol. 12, R52

16 Conrad, D.F. et al. (2010) Mutation spectrum revealed by breakpointsequencing of human germline CNVs. Nat. Genet. 42, 385–391

17 Liu, P. et al. (2012) Mechanisms for recurrent and complex humangenomic rearrangements. Curr. Opin. Genet. Dev. 22, 211–220

18 Waldman, A.S. (2008) Ensuring the fidelity of recombination inmammalian chromosomes. Bioessays 30, 1163–1171

19 Liu, P. et al. (2011) Frequency of nonallelic homologous recombinationis correlated with length of homology: evidence that ectopic synapsisprecedes ectopic crossing-over. Am. J. Hum. Genet. 89, 580–588

20 Jinks-Robertson, S. et al. (1993) Substrate length requirements forefficient mitotic recombination in Saccharomyces cerevisiae. Mol. Cell.Biol. 13, 3937–3950

21 Reiter, L.T. et al. (1998) Human meiotic recombination productsrevealed by sequencing a hotspot for homologous strand exchangein multiple HNPP deletion patients. Am. J. Hum. Genet. 62, 1023–1033

22 Alekseyev, M.A. and Pevzner, P.A. (2010) Comparative genomicsreveals birth and death of fragile regions in mammalian evolution.Genome Biol. 11, R117

23 Gao, L-Z. and Innan, H. (2004) Very low gene duplication rate in theyeast genome. Science 306, 1367–1370

24 Chen, J-M. et al. (2011) Gene conversion in human genetic disease.Genes 1, 550–663

25 Hurles, M.E. et al. (2004) Origins of chromosomal rearrangementhotspots in the human genome: evidence from the AZFa deletionhotspots. Genome Biol. 5, R55

26 Bagnall, R.D. et al. (2005) Gene conversion and evolution of Xq28duplicons involved in recurring inversions causing severehemophilia A. Genome Res. 15, 214–223

27 Caceres, M. et al. (2007) A recurrent inversion on the eutherian Xchromosome. Proc. Natl. Acad. Sci. U.S.A. 104, 18571–18576

28 Zody, M.C. et al. (2008) Evolutionary toggling of the MAPT 17q21.31inversion region. Nat. Genet. 40, 1076–1083

29 Conrad, D.F. et al. (2010) Origins and functional impact of copy numbervariation in the human genome. Nature 464, 704–712

30 Park, H. et al. (2010) Discovery of common Asian copy number variantsusing integrated high-resolution array CGH and massively parallelDNA sequencing. Nat. Genet. 42, 400–405

31 Bailey, J.A. et al. (2001) Segmental duplications: organization andimpact within the current human genome project assembly. GenomeRes. 11, 1005–1017

32 She, X. et al. (2004) Shotgun sequence assembly and recent segmentalduplications within the human genome. Nature 431, 927–930

33 Osada, N. and Innan, H. (2008) Duplication and gene conversion in theDrosophila melanogaster genome. PLoS Genet. 4, e1000305

34 Fawcett, J.A. and Innan, H. (2011) Neutral and non-neutral evolutionof duplicated genes with gene conversion. Genes 2, 191–209

35 Stankiewicz, P. and Lupski, J.R. (2002) Molecular-evolutionarymechanisms for genomic disorders. Curr. Opin. Genet. Dev. 12,312–319

36 Mefford, H.C. and Eichler, E.E. (2009) Duplication hotspots, raregenomic disorders, and common disease. Curr. Opin. Genet. Dev. 19,196–204

37 Marques-Bonet, T. et al. (2009) The origins and impact of primatesegmental duplications. Trends Genet. 25, 443–454

38 Bosch, E. et al. (2004) Dynamics of a human interparalog geneconversion hotspot. Genome Res. 14, 835–844

39 Lozier, J.N. et al. (2002) The Chapel Hill hemophilia A dog colonyexhibits a factor VIII gene inversion. Proc. Natl. Acad. Sci. U.S.A. 99,12991–12996

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

567

Page 16: Trends in genetics_-_october_2013

40 Smahi, A. et al. (2000) Genomic rearrangement in NEMO impairs NF-kB activation and is a cause of incontinentia pigmenti. Nature 405,466–472

41 Aradhya, S. et al. (2001) A recurrent deletion in the ubiquitouslyexpressed NEMO (IKK-U) gene accounts for the vast majority ofincontinentia pigmenti mutations. Hum. Mol. Genet. 10, 2171–2179

42 Zhao, Z. et al. (1998) Frequent gene conversion between human red andgreen opsin genes. J. Mol. Evol. 46, 494–496

43 DeSilva, U. et al. (1999) Comparative mapping of the region of humanchromosome 7 deleted in Williams syndrome. Genome Res. 9, 428–436

44 Antonell, A. et al. (2005) Evolutionary mechanisms shaping thegenomic structure of the Williams-Beuren syndrome chromosomalregion at human 7q11.23. Genome Res. 15, 1179–1188

45 Park, S-S. et al. (2002) Structure and evolution of the Smith-Magenissyndrome repeat gene clusters, SMS-REPs. Genome Res. 12, 729–738

46 De Raedt, T. et al. (2004) Genomic organization and evolution of theNF1 microdeletion region. Genomics 84, 346–360

47 Shaikh, T.H. et al. (2000) Chromosome 22-specific low copy repeats andthe 22q11.2 deletion syndrome: genomic organization and deletionendpoint analysis. Hum. Mol. Genet. 9, 489–501

48 Bailey, J.A. et al. (2002) Human-specific duplication and mosaictranscripts: the recent paralogous structure of chromosome 22. Am.J. Hum. Genet. 70, 83–100

49 Bi, W. et al. (2003) Reciprocal crossovers and a positional preference forstrand exchange in recombination events resulting in deletion orduplication of chromosome 17p11.2. Am. J. Hum. Genet. 73, 1302–1315

50 Forbes, S.H. et al. (2004) Genomic context of paralogous recombinationhotspots mediating recurrent NF1 region microdeletion. GenesChromosomes Cancer 41, 12–25

51 Pavlicek, A. et al. (2005) Traffic of genetic information betweensegmental duplications flanking the typical 22q11.2 deletion in velo-cardio-facial syndrome/DiGeorge syndrome. Genome Res. 15, 1487–1495

52 Bayes, M. et al. (2003) Mutational mechanisms of Williams-Beurensyndrome deletions. Am. J. Hum. Genet. 73, 131–151

53 Johnson, M.E. et al. (2006) Recurrent duplication-driven transpositionof DNA during hominoid evolution. Proc. Natl. Acad. Sci. U.S.A. 103,17626–17631

54 Jiang, Z. et al. (2007) Ancestral reconstruction of segmentalduplications reveals punctuated cores of human genome evolution.Nat. Genet. 39, 1361–1368

55 Iskow, R.C. et al. (2012) Exploring the role of copy number variants inhuman adaptation. Trends Genet. 28, 245–257

56 McGrath, C.L. et al. (2009) Minimal effect of ectopic gene conversionamong recent duplicates in four mammalian genomes. Genetics 182,615–622

57 Ezawa, K. et al. (2010) Evolutionary pattern of gene homogenizationbetween primate-specific paralogs after human and macaquespeciation using the 4-2-4 method. Mol. Biol. Evol. 27, 2152–2171

58 Jackson, M.S. et al. (2005) Evidence for widespread reticulate evolutionwithin human duplicons. Am. J. Hum. Genet. 77, 824–840

59 Sudmant, P.H. et al. (2010) Diversity of human copy number variationand multicopy genes. Science 330, 641–646

60 Mansai, S.P. et al. (2011) The rate and tract length of gene conversion.Genes 2, 313–331

61 Teshima, K.M. and Innan, H. (2004) The effect of gene conversion onthe divergence between duplicated genes. Genetics 166, 1553–1560

62 Scally, A. et al. (2012) Insights into hominid evolution from the gorillagenome sequence. Nature 483, 169–175

63 Visser, R. et al. (2005) Identification of a 3.0-kb major recombinationhotspot in patients with Sotos syndrome who carry a common 1.9-Mbmicrodeletion. Am. J. Hum. Genet. 76, 52–67

64 Kurotaki, N. et al. (2005) Sotos syndrome common deletion is mediatedby directly oriented subunits within inverted Sos-REP low-copyrepeats. Hum. Mol. Genet. 14, 535–542

65 Sharp, A.J. et al. (2007) Characterization of a recurrent 15q24microdeletion syndrome. Hum. Mol. Genet. 16, 567–572

66 Kumar, R.A. et al. (2008) Recurrent 16p11.2 microdeletions in autism.Hum. Mol. Genet. 17, 628–638

67 Weiss, L.A. et al. (2008) Association between microdeletion andmicroduplication at 16p11.2 and autism. N. Engl. J. Med. 358, 667–675

68 Kiyosawa, H. and Chance, P.F. (1996) Primate origin of the CMT1A-REP repeat and analysis of a putative transposon-associatedrecombinational hotspot. Hum. Mol. Genet. 5, 745–753

69 Hurles, M.E. (2001) Gene conversion homogenizes the CMT1Aparalogous repeats. BMC Genomics 2, 11

70 Lindsay, S.J. et al. (2006) A chromosomal rearrangement hotspot canbe identified from population genetic variation and is coincident with ahotspot for allelic recombination. Am. J. Hum. Genet. 79, 890–902

71 Shaikh, T.H. et al. (2007) Low copy repeats mediate distal chromosome22q11.2 deletions: sequence analysis predicts breakpoint mechanisms.Genome Res. 17, 482–491

72 Saunier, S. et al. (2000) Characterization of the NPHP1 locus:mutational mechanism involved in deletions in familial juvenilenephronophthisis. Am. J. Hum. Genet. 66, 778–789

73 Balciuniene, J. et al. (2007) Recurrent 10q22-q23 deletions: a genomicdisorder on 10q associated with cognitive and behavioralabnormalities. Am. J. Hum. Genet. 80, 938–947

74 Ballif, B.C. et al. (2010) Identification of a recurrent microdeletion at17q23.1q23.2 flanked by segmental duplications associated with heartdefects and limb abnormalities. Am. J. Hum. Genet. 86, 454–461

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

568

Page 17: Trends in genetics_-_october_2013

Human housekeeping genes, revisitedEli Eisenberg1 and Erez Y. Levanon2

1 Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel Aviv 69978, Israel2 Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 52900, Israel

Housekeeping genes are involved in basic cell mainte-nance and, therefore, are expected to maintain constantexpression levels in all cells and conditions. Identificationof these genes facilitates exposure of the underlyingcellular infrastructure and increases understanding ofvarious structural genomic features. In addition, house-keeping genes are instrumental for calibration in manybiotechnological applications and genomic studies.Advances in our ability to measure RNA expression haveresulted in a gradual increase in the number of identifiedhousekeeping genes. Here, we describe housekeepinggene detection in the era of massive parallel sequencingand RNA-seq. We emphasize the importance of expres-sion at a constant level and provide a list of 3804 humangenes that are expressed uniformly across a panel oftissues. Several exceptionally uniform genes are singledout for future experimental use, such as RT-PCR controlgenes. Finally, we discuss both ways in which currenttechnology can meet some of past obstacles encoun-tered, and several as yet unmet challenges.

The concept of housekeeping genesHousekeeping genes are genes that are required for themaintenance of basal cellular functions that are essentialfor the existence of a cell, regardless of its specific role inthe tissue or organism. Thus, they are expected to beexpressed in all cells of an organism under normal condi-tions, irrespective of tissue type, developmental stage, cellcycle state, or external signal. From a fundamental point ofview, full characterization of the minimal set of genesrequired to sustain life is of special interest [1,2]. In addi-tion, housekeeping genes are widely used as internal con-trols for experimental as well as computational studies[3–7]. Furthermore, many studies have highlighted uniquegenomic and evolutionary features of this special group ofgenes. For example, housekeeping genes were shown tohave shorter introns and exons [8–11], a different repeti-tive sequence environment [enriched in short interspersedelements (SINEs) and depleted in long interspersed ele-ments (LINEs)] [12,13], more simple sequence repeats inthe 50 untranslated region (UTR) [14], lower conservationof the promoter sequence [15], and lower potential fornucleosome formation in the 50 region of these genes[16]. Protein products of housekeeping genes are enrichedin some domain families [17]. These studies shed light ongeneral aspects of gene structure and evolution.

Early detection schemes for housekeeping genesThe notion of housekeeping genes has been in use in theliterature for nearly 40 years. In particular, several mam-malian genes have been used widely as internal controls inexperimental expression studies, such as glyceraldehyde-3-phosphate dehydrogenase (GAPDH), tubulins, cyclophi-lin, albumin, actins, 18S rRNA or 28S rRNA. Yet, only atthe turn of the 21st century, with the advancement oftranscriptome profiling technology, did it become possibleto identify, systematically, a set of housekeeping genes.These first attempts used large-scale expression data[18–20] or, more often, microarray profiling to look atthe expression levels of many genes across a panel of tissuesamples. Typically, they resulted in lists of hundreds tothousands of genes [8,19–25], many more than the dozen orso commonly used control genes.

Generally, the many lists produced show a considerablelevel of consistency. Typically, the intersection of any twoof them yields approximately 50% coverage [8,24,26], sug-gesting that the sets are enriched in housekeeping genesbut still lacking in specificity and selectivity. This could bepartly attributed to the limited number of tissues exam-ined in each separate analysis and the differences betweenthe tissues across analyses. However, it is likely thattechnological limitations affecting the underlying datahave contributed much to the quality and reproducibilityof the results.

In particular, first-generation microarray technology isknown to have had many problematic nonspecific probes[27]. Even the improved versions of microarrays are typi-cally assumed to achieve only an approximately twofoldaccuracy in expression level measurement, and they arelimited in their dynamical range. These inaccuracies couldhave large effects on deciding whether a gene is expressed(regardless of the rather arbitrary expression cutoff used todetermine which probe set is ‘expressed’).

A second, more fundamental, issue relates to the verydefinition of housekeeping genes. Should one look for genesmerely being expressed in all tissues, or should the genealso be expressed at a constant level across tissues? Earlystudies generally adopted the first definition and, in fact,GAPDH and other popular housekeeping genes for experi-mental controls have been found to vary considerablyacross tissues [3,28–30]. This choice was the pragmaticone to make, because it enabled the use of the binarypresent or absent calls of the microarray and renderednormalization issues unnecessary. However, this approachhas two shortcomings. First, measurement errors andstochastic noise make it difficult to distinguish genesabsent from the sample from those weakly expressed.Second, and more importantly, it was later appreciated

Opinion

0168-9525/$ – see front matter

� 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.05.010

Corresponding author: Eisenberg, E. ([email protected]).Keywords: housekeeping genes; RNA-seq; gene expression patterns; internal control;next generation sequencing.

Trends in Genetics, October 2013, Vol. 29, No. 10 569

Page 18: Trends in genetics_-_october_2013

that a large part of the genome is expressed at a low basallevel in all tissues [31]. Thus, most genes are expressed atsome background level in all tissues. In light of this obser-vation, and to make the concept of housekeeping genesmore useful, one should either modify the definition ofhousekeeping genes to ‘genes that are expressed abovesome cutoff level’, which necessarily introduces an arbi-trary parameter explicitly, or rather adopt the secondoption above and look for genes that are expressed at aconstant level across all normal tissues.

Introducing an expression cutoff requires a quantitativecomparison of expression levels of different genes in thesame sample. This is known to be a complex problem,due to questions of bias in PCR amplification, differentprobe affinities, and so on. Furthermore, normalizing thevalues obtained from different experiments is also a non-trivial challenge. Early microarrays studies generally usedlinear normalization, setting the mean expression level, orthe trimmed mean, constant. Later, the more sophisticatedquantile normalization was introduced [32]. These andother normalization procedures generally assume similarexpression-value distributions for all samples studied.This could be justified for samples coming from identicalor highly similar biological conditions, perhaps even forhealthy and diseases samples of the same tissue. However,it is not yet clear how accurate this assumption is for cross-tissue comparisons, and how much it skews the results[33].

A third issue that was not fully addressed in previousstudies of housekeeping genes is alternative splicing. It hasbeen appreciated for more than a decade that most humangenes have more than one isoform [34,35]. Thus, one couldenvision a situation in which one splice variant is consti-tutively expressed, making it a housekeeping transcript,whereas another transcript from the same gene exhibits amore complex expression profile (Figure 1A). Moreover, itis possible that a single gene expresses one transcript inone set of tissues and another transcript in other tissues,such that the gene, as such, is always expressed, but eachtranscript is specific to a subset of tissues. In principle,then, one would like to define the set of housekeepingtranscripts. Early microarray technology did rather poorlyin distinguishing between transcripts and, thus, somestudies deliberately ‘zoomed out’ to the gene level.

Housekeeping genes in the deep-sequencing eraNew horizons are opening as deep-sequencing technologytakes over microarrays as the method of choice for tran-scriptome profiling [36]. RNA-seq was found to be prefera-ble to microarrays as a tool for expression measurement.Unlike microarrays, RNA-seq does not require pre-knowl-edge of the genomic sequence (although it is helpful foranalysis), and requires smaller amounts of RNA. It pro-vides information at the single-base level, enabling betterassessment of alternative splicing and even allelic varia-tion. Background levels in RNA-seq are lower, due to thebetter specificity and improved control of in silico sequencealignment compared with probe hybridization. Conse-quently, a wider dynamic range is accessible. Importantly,RNA-Seq is also more accurate in quantifying spike-inRNA controls of known concentration, and produces

expression values that correlate better with quantitativePCR (qPCR) results [36] and protein levels [37]. This newand improved platform enables some of the challenges tobe met that have been standing for many years, but it alsoopens up new questions.

In terms of normalization, read coverage generally pro-vides a rather robust measure for comparing differentgenomic regions within the same sample. Exceptions tothis are generally a result of alignment problems in repeti-tive or duplicative regions (Figure 1B). For the task ofhousekeeping gene identification, these can be partlyavoided by limiting analysis to the nonrepetitive codingregions of the exons [33] and using long reads. Note,however, that highly expressed coding exons (e.g.,GAPDH) are prone to having more duplications [38],resulting in alignment problems. Small-scale PCR biasesare expected to be washed out when looking at the aver-aged expression level over whole exons. By contrast, theissue of cross-tissue normalization is still open. The popu-lar reads per kilobase per million mapped reads (RPKM)measure takes care of normalizing for the two most obviousfactors affecting the raw number of reads per gene, tran-script, or exon: the total number of reads produced andtheir length [39]. The RPKM measure is simple andstraightforward, but does not fully solve the between-sample normalization issue. More subtle biases, resultingfrom variations in transcript length distribution in thesample, coverage dependence on local sequence due toGC content, priming and other biases, and variability inmappability of different regions were detected [40–45].

A

(A)

(B)

(C)

??

B

B

C

A B C

A A′ B′

TRENDS in Genetics

Figure 1. Examples of challenges in housekeeping gene detection. (A) Genes

having several splice variants could have different expression levels [indicated by

the number of reads (black bars)] for different parts of the gene. (B) Duplicative

regions, due to pseudogenes and other duplications, complicate unique read

alignments, thus biasing expression-level measurement. (C) Expression

measurement has several biases, including the lower expression (on average) of

the upstream exons due to imperfect reverse transcription resulting in partial

cDNA molecules.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

570

Page 19: Trends in genetics_-_october_2013

There is still no consensus as to the best way to account forall of these in a standard and consistent way.

In terms of housekeeping gene identification, RNA-seqdata indeed show explicitly that basal (leaky) low expressionlevels can be found throughout the genome. Therefore, anydefinition of housekeeping genes should refer to the quanti-tative expression level. This can be done using a cutoff, or byadding the requirement of low variability in expressionacross tissues. Here, we promote the latter course of action.Setting a cutoff value as the main criteria for defining thehousekeeping genes is undesirable for three reasons. First,there seems to be no natural cutoff value, thus forcing one tomake an arbitrary choice. Second, due to the lack of a properintergene normalization scheme, the same RPKM values fordifferent genes could indicate different expression levels[4,46]. Third, using the expression level as a measure ofimportance for cell function is also questionable: cells arelikely to require different gene products at different concen-trations. There is no good reason to exclude genes that areconstantly expressed at a mid rather than a high level. Thus,we feel that low variability should be used as the maincriteria for selecting housekeeping genes.

Another advantage of RNA-seq data is that they mea-sure the expression along the gene (similar to the olderexon arrays) and can thereby provide expression at theexon level. Some software tools try to extract transcriptexpression levels from RNA-seq data (e.g., [47]). However,there is still much to be desired in terms of reliabilitywithin the limits of current technology [43]. This isexpected to improve significantly, as read length increases.Note that recent findings [48] show significant variabilityin exon boundaries, making even the comparison of exonexpression imperfect. An interim partial solution, whichwe adopt below, is to measure expression at the more basicexon level and aim to define a set of housekeeping exons.

Extracting a set of housekeeping genes from HumanBodyMap dataHere, we demonstrate the power of the new technology foridentifying housekeeping genes by analyzing expressiondata from the Human BodyMap (HBM) 2.0 Project. Thisincludes publicly available RNA-Seq data (GEO accessionnumber GSE30611, HBM), generated on HiSeq 2000instruments, providing expression profiling in 16 normalhuman tissue types: adrenal, adipose, brain, breast, colon,heart, kidney, liver, lung, lymph, ovary, prostate, skeletalmuscle, testes, thyroid, and white blood cells. Two differentread lengths were used for each tissue (2 � 50-bp paired-end and 1 � 75-bp single-read data), each of which wassequenced in a separate HiSeq 2000 lane.

We aligned the reads to the genome using the Bowtie2aligner [49] and measured the read coverage of each of thecoding exons of the (uniquely aligned) RefSeq sequences[50], in normalized RPKM units. For exons that werepartly coding, only the coding part was considered. Shortexons (<50 bp) are prone to alignment problems and werediscarded. We compared the RPKM values obtained fromthe paired-end data and the single-read data to assess thetechnical reproducibility of the RPKM measure, and foundthat the typical fold-ratio between the two was 1.5(Figure 2A). We observed a bias against the upstream

exons of transcripts, which tended to have a lower expres-sion levels. This effect might result from imperfect reversetranscription resulting in cDNA missing the upstream partof the transcript (Figure 1C).

-1.5 -1 -0.5 0 0.5 1 1.5

log2(RPKM50_PE/RPKM75 )

0

(A)

(B)

(C)

1

0 0.25 0.5Frac�on of exons passing

0.01

1

100

Cuto

ff va

lue

(RPK

M)

Minimum expression over �ssuesKey:

Geometric mean expression

0 0.1 0.2 0.3 0.4 0.5Frac�on of exons below cutoff

0

0.5

1

1.5

2

2.5

std[

log 2(R

PKM

)] cu

toff

TRENDS in Genetics

Figure 2. Characterization of the expression profile in Human BodyMap (HBM)

data. (A) Reproducibility of the measured reads per kilobase per million mapped

reads (RPKM) levels per exon, as assessed by comparing the 50-bp paired-end and

the 75-bp single-read data. The continuous line is the best fit for a Gaussian

distribution, added to accentuate the fat tails of the actual distribution. The width

of the distribution is approximately 0.55 (log2 units), leading to a typical variability

of 1.5-fold. (B) Fraction of exons expressed above a cutoff value in all 16 tissues, for

different cutoff values. In total, 55% of all exons are expressed to a detectable level

in the HBM data set. (C) Cumulative distribution of the exon expression variance.

Most of the exons being expressed in all tissues have standard-deviation

[log2(RPKM)] values between 0.7 and 1.5.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

571

Page 20: Trends in genetics_-_october_2013

Figure 2B presents the fraction of exons being expressedabove a certain cutoff RPKM value in all tissues. Note thatapproximately 55% of all exons are expressed at a detect-able level in all HBM tissues, demonstrating why the old

definition of housekeeping genes is not useful. In addition,it is hard to detect a natural expression cutoff value. Thevariation in expression level is estimated by the standarddeviation of log2(RPKM) over samples. Figure 2C shows

Table 1. Genes proposed for calibrationa

Gene symbol RefSeq

accession

number

Gene name Genomic coordinates (hg19) of exons passing

the filters

C1orf43 NM_015449 Chromosome 1 open reading frame 43 chr1 154192817 154192883

chr1 154186932 154187050

chr1 154186368 154186422

chr1 154184933 154185100

chr1 154184795 154184854

CHMP2A NM_014453 Charged multivesicular body protein 2A chr19 59065411 59065579

chr19 59063625 59063805

chr19 59063421 59063552

EMC7 NM_020154 ER membrane protein complex subunit 7 chr15 34382517 34382656

chr15 34380253 34380334

chr15 34376537 34376687

GPI NM_000175 Glucose-6-phosphate isomerase chr19 34857687 34857756

chr19 34859487 34859607

chr19 34868639 34868786

chr19 34869838 34869910

chr19 34872370 34872424

chr19 34884152 34884213

chr19 34884818 34884971

chr19 34887205 34887335

chr19 34887485 34887562

chr19 34890111 34890240

chr19 34890460 34890536

chr19 34890623 34890690

PSMB2 NM_002794 Proteasome subunit, beta type, 2 chr1 36101910 36102033

chr1 36096874 36096945

chr1 36070833 36070883

PSMB4 NM_002796 Proteasome subunit, beta type, 4 chr1 151372456 151372663

chr1 151372917 151373064

chr1 151373239 151373321

chr1 151373714 151373831

RAB7A NM_004637 Member RAS oncogene family chr3 128525214 128525433

chr3 128526385 128526514

chr3 128532169 128532262

REEP5 NM_005669 Receptor accessory protein 5 chr5 112256859 112256953

chr5 112238076 112238215

chr5 112222711 112222880

SNRPD3 NM_004175 Small nuclear ribonucleoprotein D3 chr22 24953642 24953768

chr22 24963951 24964144

VCP NM_007126 Valosin containing protein chr9 35067887 35068060

chr9 35066671 35066814

chr9 35064150 35064282

chr9 35062213 35062347

chr9 35061999 35062135

chr9 35061573 35061686

chr9 35061011 35061176

chr9 35060797 35060920

chr9 35060309 35060522

chr9 35059489 35059798

chr9 35059060 35059216

chr9 35057372 35057527

chr9 35057116 35057219

chr12 110930800 110931036

VPS29 NM_016226 Vacuolar protein sorting 29 homolog chr12 110929812 110929927

chr12 110929812 110929927

aGenes chosen have most of their exons showing geometrical mean expression exceeding RPKM = 50, standard deviation of log2(RPKM) <0.5, and no single tissue

showing an expression level different from the geometrical mean by twofold or more. Genes with pseudogenes were excluded.

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

572

Page 21: Trends in genetics_-_october_2013

the cumulative distribution of these standard deviationvalues for the different exons. To define housekeepingexons, the exon must be expressed in all tissues at anynonzero level, and must exhibit a uniform expression levelacross tissues. Thus, we adopted the following criteria: (i)expression observed in all tissues; (ii) low variance overtissues: standard-deviation [log2(RPKM)]<1; and (iii) noexceptional expression in any single tissue; that is, no log-expression value differed from the averaged log2(RPKM)by two (fourfold) or more. These criteria resulted in a list of37 363 unique exons (20% of studied exons), belonging to11 648 RefSeq transcripts and 6289 genes. These includedmost of the stable housekeeping genes reported based onmicroarray data [30].

We define a housekeeping gene as a gene for which atleast one RefSeq transcript has more than half of its exonsmeeting the previous criteria (thus being housekeepingexons). Altogether, we found 3804 such human housekeep-ing genes. The lists of housekeeping exons and housekeep-ing genes are available at http://www.tau.ac.il/�elieis/HKG/. In addition, we propose a short list of highly uniformand strongly expressed genes that may be used for calibra-tion in future experimental settings (Table 1).

As expected, the housekeeping genes are enriched ingene ontology (GO) categories associated with basic cellu-lar activity, such as gene expression and biogenesis ofnucleotides and amino acids, catabolic processes, proteinlocalization, and so on [51]. The overlap with previous listsis partial, due to the different definition of housekeepinggenes used. In particular, GAPDH and actin beta (ACTB)do not appear in our new list, because these genes varyacross tissues [3,28–30]. Nevertheless, some of the mostpronounced features previously reported for housekeepinggenes, such as the much shorter introns [8–11] and moreduplications [52], also characterize the new set.

Concluding remarksCurrent technology enables global measurement of expres-sion levels with unprecedented accuracy. This advance-ment has revealed that large parts of the genome arenormally expressed at a low level. Accordingly, we foundthat most human exons are expressed at some level in allthe human tissues studied. This new technological era callsthe community to reevaluate the concept of a housekeepinggene. Here, we have presented our own perspective, sug-gesting the use of low expression variation as the maincriteria for defining housekeeping genes. We also providesets of exons and genes that are ubiquitously and uniform-ly expressed, as well as a short list of genes suitable forexperimental calibration.

More high-quality deep-sequencing transcriptome pro-filing data are expected to emerge in the near future,enabling improvements of the analysis described hereusing better statistics for the tissues studied and addingmore tissue types. Furthermore, including extreme patho-logical conditions relevant for various tissues could furtherpurify the housekeeping genes list [53]. A significant ad-vance should come from new experiments currently beingdone on single-cell transcriptome profiling [54]. This couldimprove the specificity in detecting housekeeping genes,narrowing the list to genes that are expressed in each and

every single cell. In addition, accumulation of tissue-spe-cific epigenetic data, such as histone marks and nucleotidemethylations, could be used in the future to better distin-guish regulated expression from low-level noise.

As discussed above, normalization (within a sample andacross samples) is still an unresolved issue. Advancementin this direction could greatly improve housekeeping genedetection. In addition, usage of longer reads is expected todecrease alignment errors and reduce bias. Longer reads(and improved analysis tools) are expected to raise consid-erably the sensitivity of expression level measurement atthe transcript level, enabling direct evaluation of thehousekeeping splice-variants list.

In conclusion, the dramatic advancement of sequencingtechnologies calls for a reassessment of the notion ofhousekeeping genes, and allows for improving quantita-tively and qualitatively the resolution. We thus provideupdated lists of housekeeping exons and genes for publicuse, available at http://www.tau.ac.il/�elieis/HKG/. It isexpected that emerging technologies could very soon facili-tate meeting the yet open challenges, allowing for betterand more accurate housekeeping gene profiling.

AcknowledgmentsWe thank Ami Haviv and Gilad Finkelstein for help with reads’alignments, and Lily Bazak for help in gene lengths’ analysis. This workwas supported by Israel Science Foundation 379/12 (EE), by the I-CORE Program of the Planning and Budgeting Committee and the IsraelScience Foundation (grant No 41/11) and by the Marie Curie IntegrationGrant 256593(EYL).

References1 Fraser, C.M. et al. (1995) The minimal gene complement of

Mycoplasma genitalium. Science 270, 397–4032 Koonin, E.V. (2000) How many genes can make a cell: the minimal-

gene-set concept. Annu. Rev. Genomics Hum. Genet. 1, 99–1163 Thellin, O. et al. (1999) Housekeeping genes as internal standards: use

and limits. J. Biotechnol. 75, 291–2954 Robinson,M.D.and Oshlack,A. (2010)Ascalingnormalizationmethodfor

differential expression analysis of RNA-seq data. Genome Biol. 11, R255 Dheda, K. et al. (2004) Validation of housekeeping genes for

normalizing RNA expression in real-time PCR. Biotechniques 37,112–114, 116, 118–119

6 Rubie, C. et al. (2005) Housekeeping gene variability in normal andcancerous colorectal, pancreatic, esophageal, gastric and hepatictissues. Mol. Cell. Probes 19, 101–109

7 Vandesompele, J. et al. (2002) Accurate normalization of real-timequantitative RT-PCR data by geometric averaging of multipleinternal control genes. Genome Biol. 3, RESEARCH0034

8 Eisenberg, E. and Levanon, E.Y. (2003) Human housekeeping genesare compact. Trends Genet. 19, 362–365

9 Vinogradov, A.E. (2004) Compactness of human housekeeping genes:selection for economy or genomic design? Trends Genet. 20, 248–253

10 Carmel, L. and Koonin, E.V. (2009) A universal nonmonotonicrelationship between gene compactness and expression levels inmulticellular eukaryotes. Genome Biol. Evol. 1, 382–390

11 Castillo-Davis, C.I. et al. (2002) Selection for short introns in highlyexpressed genes. Nat. Genet. 31, 415–418

12 Eller, C.D. et al. (2007) Repetitive sequence environment distinguisheshousekeeping genes. Gene 390, 153–165

13 Versteeg, R. et al. (2003) The human transcriptome map revealsextremes in gene density, intron length, GC content, and repeatpattern for domains of highly and weakly expressed genes. GenomeRes. 13, 1998–2004

14 Farre, D. et al. (2007) Housekeeping genes tend to show reducedupstream sequence conservation. Genome Biol. 8, R140

15 Lawson, M.J. and Zhang, L. (2008) Housekeeping and tissue-specificgenes differ in simple sequence repeats in the 50-UTR region. Gene 407,54–62

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

573

Page 22: Trends in genetics_-_october_2013

16 Ganapathi, M. et al. (2005) Comparative analysis of chromatinlandscape in regulatory regions of human housekeeping and tissuespecific genes. BMC Bioinformatics 6, 126

17 Lehner, B. and Fraser, A.G. (2004) Protein domains enriched inmammalian tissue-specific or widely expressed genes. Trends Genet.20, 468–472

18 Velculescu, V.E. et al. (1999) Analysis of human transcriptomes. Nat.Genet. 23, 387–388

19 Zhu, J. et al. (2008) How many human genes can be defined ashousekeeping with current expression data? BMC Genomics 9, 172

20 Zhu, J. et al. (2008) On the nature of human housekeeping genes.Trends Genet. 24, 481–484

21 Chang, C-W. et al. (2011) Identification of human housekeeping genesand tissue-selective genes by microarray meta-analysis. PLoS ONE 6,e22859

22 Hsiao, L.L. et al. (2001) A compendium of gene expression in normalhuman tissues. Physiol. Genomics 7, 97–104

23 Lee, S. et al. (2007) Identification of novel universal housekeepinggenes by statistical analysis of microarray data. J. Biochem. Mol. Biol.40, 226–231

24 She, X. et al. (2009) Definition, conservation and epigenetics ofhousekeeping and tissue-enriched genes. BMC Genomics 10, 269

25 Warrington, J.A. et al. (2000) Comparison of human adult and fetalexpression and identification of 535 housekeeping/maintenance genes.Physiol. Genomics 2, 143–147

26 Butte, A.J. et al. (2001) Further defining housekeeping, or‘maintenance’, genes Focus on ‘A compendium of gene expression innormal human tissues’. Physiol. Genomics 7, 95–96

27 Irizarry, R.A. et al. (2003) Summaries of Affymetrix GeneChip probelevel data. Nucleic Acids Res. 31, e15

28 Barber, R.D. et al. (2005) GAPDH as a housekeeping gene: analysis ofGAPDH mRNA expression in a panel of 72 human tissues. Physiol.Genomics 21, 389–395

29 Lee, P.D. et al. (2002) Control genes and variability: absence ofubiquitous reference transcripts in diverse mammalian expressionstudies. Genome Res. 12, 292–297

30 De Jonge, H.J.M. et al. (2007) Evidence based selection of housekeepinggenes. PLoS ONE 2, e898

31 Kapranov, P. et al. (2007) Genome-wide transcription and theimplications for genomic organization. Nat. Rev. Genet. 8, 413–423

32 Bolstad, B.M. et al. (2003) A comparison of normalization methods forhigh density oligonucleotide array data based on variance and bias.Bioinformatics 19, 185–193

33 Ramskold, D. et al. (2009) An abundance of ubiquitously expressedgenes revealed by tissue transcriptome sequence data. PLoS Comput.Biol. 5, e1000598

34 Modrek, B. and Lee, C. (2002) A genomic view of alternative splicing.Nat. Genet. 30, 13–19

35 Johnson, J.M. et al. (2003) Genome-wide survey of human alternativepre-mRNA splicing with exon junction microarrays. Science 302,2141–2144

36 Wang, Z. et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics.Nat. Rev. Genet. 10, 57–63

37 Fu, X. et al. (2009) Estimating accuracy of RNA-Seq and microarrayswith proteomics. BMC Genomics 10, 161

38 Zhang, Z. et al. (2003) Millions of years of evolution preserved: acomprehensive catalog of the processed pseudogenes in the humangenome. Genome Res. 13, 2541–2558

39 Mortazavi, A. et al. (2008) Mapping and quantifying mammaliantranscriptomes by RNA-Seq. Nat. Methods 5, 621–628

40 Wagner, G.P. et al. (2012) Measurement of mRNA abundance usingRNA-seq data: RPKM measure is inconsistent among samples. TheoryBiosci. 131, 281–285

41 Dillies, M-A. et al. (2012) A comprehensive evaluation of normalizationmethods for Illumina high-throughput RNA sequencing data analysis.Brief. Bioinform. http://dx.doi.org/10.1093/bib/bbs046

42 Dohm, J.C. et al. (2008) Substantial biases in ultra-short read data setsfrom high-throughput DNA sequencing. Nucleic Acids Res. 36, e105

43 Schwartz, S. et al. (2011) Detection and removal of biases in theanalysis of next-generation sequencing reads. PLoS ONE 6, e16685

44 Li, J. et al. (2010) Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 11, R50

45 Jones, D.C. et al. (2012) Compression of next-generation sequencingreads aided by highly efficient de novo assembly. Nucleic Acids Res. 40,e171

46 Roberts, A. et al. (2011) Improving RNA-Seq expression estimates bycorrecting for fragment bias. Genome Biol. 12, R22

47 Trapnell, C. et al. (2010) Transcript assembly and quantification byRNA-Seq reveals unannotated transcripts and isoform switchingduring cell differentiation. Nat. Biotechnol. 28, 511–515

48 Pelechano, V. et al. (2013) Extensive transcriptional heterogeneityrevealed by isoform profiling. Nature 497, 127–131

49 Langmead, B. and Salzberg, S.L. (2012) Fast gapped-read alignmentwith Bowtie 2. Nat. Methods 9, 357–359

50 Pruitt, K.D. et al. (2012) NCBI Reference Sequences (RefSeq): currentstatus, new features and genome annotation policy. Nucleic Acids Res.40, D130–D135

51 Huang, D.W. et al. (2009) Systematic and integrative analysis of largegene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57

52 Zhang, Z. et al. (2004) Comparative analysis of processed pseudogenesin the mouse and human genomes. Trends Genet. 20, 62–67

53 Chen, M. et al. (2013) Identification of human HK genes and geneexpression regulation study in cancer from transcriptomics dataanalysis. PLoS ONE 8, e54082

54 Tang, F. et al. (2009) mRNA-Seq whole-transcriptome analysis of asingle cell. Nat. Methods 6, 377–382

Opinion Trends in Genetics October 2013, Vol. 29, No. 10

574

Page 23: Trends in genetics_-_october_2013

Feature Review

Properties and rates of germlinemutations in humansCatarina D. Campbell1 and Evan E. Eichler1,2

1 Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA2 Howard Hughes Medical Institute, Seattle, WA 98195, USA

All genetic variation arises via new mutations; therefore,determining the rate and biases for different classes ofmutation is essential for understanding the genetics ofhuman disease and evolution. Decades of mutation rateanalyses have focused on a relatively small number ofloci because of technical limitations. However, advancesin sequencing technology have allowed for empiricalassessments of genome-wide rates of mutation. Recentstudies have shown that 76% of new mutations originatein the paternal lineage and provide unequivocal evidencefor an increase in mutation with paternal age. Althoughmost analyses have focused on single nucleotide var-iants (SNVs), studies have begun to provide insight intothe mutation rate for other classes of variation, includingcopy number variants (CNVs), microsatellites, and mo-bile element insertions (MEIs). Here, we review thegenome-wide analyses for the mutation rate of severaltypes of variants and suggest areas for future research.

The fundamental process in geneticsThe replication of the genome before cell division is aremarkably precise process. Nevertheless, there are someerrors during DNA replication that lead to new mutations.If these errors occur in the germ cell lineage (i.e., the spermand egg), then these mutations can be transmitted tooffspring. Some of these new genetic variants will bedeleterious to the organism, and a select few will beadvantageous and serve as substrates for selection. There-fore, knowledge about the rate at which new mutationsappear and the properties of new mutations is critical inthe study of human genetics from evolution to disease. Thestudy of the mutation rate in humans dates back furtherthan the discovery of the structure of DNA or the determi-nation of DNA as the genetic material. In seminal workperformed during the 1930s and 1940s, J.B.S. Haldanestudied hemophilia with the assumption of a mutation–selection balance to estimate mutation rate at that locusand determined that most new mutations arose in thepaternal germline [1,2]. Until recently, most mutation rateanalyses were similar to this initial work in that theyextrapolated rates and properties from a handful of loci

(often linked to dominant genetic disorders; for example,see [3]). Over the past few years, it has become feasible togenerate large amounts of sequence data (including thegenomes of parents and their offspring), and it is nowpossible to calculate empirically a genome-wide mutationrate. In addition, much interest has focused on under-standing the role of de novo mutations in human disease.Therefore, in this review, we synthesize the recent anal-yses of mutation rate for multiple forms of genetic varia-tion and discuss their implications with respect to humandisease and evolution.

SNV mutation rateIt is now feasible to perform whole-genome sequencing onall individuals from a nuclear family; from these data, onecan identify de novo mutations that ‘disobey’ Mendelianinheritance (Box 1, Figure I). The first two papers to applythis approach were limited in scope to three families [4,5],thus restricting the total number of de novo SNVs ob-served. Even with this limitation, these two analysesreported similar overall mutation rates of approximately1 � 10�8 SNV mutation per base pair per generation,although there was considerable variation in families[4,5]. A more recent study using whole-sequence data from78 Icelandic parent–offspring trios suggested a higher rateof 1.2 � 10�8 SNVs per generation from de novo mutations[6]. Another study used autozygous segments (see Glossa-ry) in the genomes of Hutterite trios, who were descendedfrom a 13-generation pedigree with 64 founders, to calcu-late independently the same SNV mutation rate of

Review

Glossary

Autozygosity: large regions of homozygous sequence inherited from a recent

ancestor; also referred to as homozygosity by recent descent.

De novo mutation: a mutation observed in a child but not in his or her parents.

Such mutations are assumed to have occurred in one of the parental germlines.

Haplotype phase: determination of which alleles segregate on the same

physical chromosomes. For example, which alleles of nearby variants in a child

occur on the chromosome inherited from his or her father.

Microsatellite: a locus comprising a simple repeat of DNA bases. The repeating

unit usually comprises two, three, or four bases.

rDNA: the regions of the genome encoding ribosomal RNA. These comprise

repeating units of either 2.2 kbp located on chromosome 1 or 43 kbp located on

the acrocentric chromosomes.

Retrotransposon: a DNA sequence that copies itself through an mRNA

intermediate and reinserts the copied sequence through reverse transcription

into a new location in the genome.

Segmental duplication (SD): a segment (>1 kbp) of high sequence identity

(>90%) that exists at two or more locations in a genome.0168-9525/$ – see front matter

� 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.04.005

Corresponding author: Eichler, E.E. ([email protected]).Keywords: germline mutation rate; de novo mutation; paternal bias; paternal age;genome wide.

Trends in Genetics, October 2013, Vol. 29, No. 10 575

Page 24: Trends in genetics_-_october_2013

1.2 � 10�8 [7]. A study of ten additional families of indi-viduals affected with autism reported a rate of 1 � 10�8 [8].

In addition to the direct approaches in families, earlierstudies used more indirect approaches to estimate muta-tion rate. Using fixed differences between the human andchimpanzee genomes (Box 1) yielded a mutation rate forSNVs of approximately 2.5 � 10�8 in pseudogenes, whereselection is not a confounding factor [9,10]; this is overtwofold higher than the rates estimated from directapproaches. However, more recent comparisons of thehuman, chimpanzee, and gorilla genomes bring the muta-tion rate estimates in line with what is observed in family-based analyses [11]. Another indirect approach estimatedthe mutation rate for SNVs to be 1.82 � 10�8 using in-ferred ancestry of nearby microsatellites [12] (Box 1,Figure I). The difference between this mutation rateand those calculated with family information may bedue to differences in filtering applied for SNVs or insequencing methodology.

Recent genome-wide studies of the SNV mutation ratein humans have started to converge (Table 1). Studiesbased on whole-genome sequencing and direct estimatesof de novo mutations give an average SNV mutation rate of1.16 � 10�8 mutations per base pair per generation [95%confidence interval (CI) of the mean: 1.11–1.22] in 96 totalfamilies [4–8] (Table 1). However, it is important to notethat all of these studies involve substantial filtering of de

novo variants to remove false positives and often excludehighly repetitive regions of the genome. Given the rele-vance of variants in protein-coding sequence to disease, itis also important to understand the mutation rate in exonicregions. Studies from targeted sequencing of exomes orother regions have reported higher mutation rates (1.31–2.17 � 10�8 mutations per base pair per generation) [13–16]; this apparent increase may be due to several factors, asdiscussed below.

CNV mutation rateIn addition to SNVs, there has been considerable effort inestimating the rates of formation of CNVs. Although CNVsare operationally defined as deletions and duplications of50 bp or more [17], most studies have assessed de novoevents only in the multi-kilobase pair range. As with SNVs,initial studies in this area focused on only a few loci. Theseanalyses found that the locus mutation rate was higher forCNVs (2.5 � 10�6–1 � 10�4 mutations per locus per gener-ation) compared with SNVs and that the rate varied bymore than an order of magnitude between loci [18,19]; datafrom mice suggest that the difference in rates between lociare even larger [20]. A genome-wide analysis of large CNVs(>100 kbp) revealed a mutation rate of 1.2 � 10�2 CNVsper generation based on approximately 400 parent–off-spring trios [21]. A significantly higher mutation rate of3.6 � 10�2 mutations per generation was observed for

Box 1. Methods for discovering new mutations and estimating mutation rate

Most of the methods developed for estimating mutation rate were

developed for SNV data, but can be applied more broadly to other

forms of variation. The most common approach for estimating

mutation rate is to use families to look for mutations carried by a

child but not by either of his or her parents (Figure I). This

approach has been carried out on selected loci up to whole

genomes. However, it is important to note that this method can be

confounded by false positives for which putative de novo variants

are enriched [5]. In addition, somatic mutations in offspring of the

sequenced families cannot be distinguished from germline de novo

variants.

The other classical approach for estimating mutation rates is to look

at fixed differences between species [9,10]. The mutation rate can

then be calculated based on the estimated divergence time between

the species (Figure I). Although this approach is not confounded by

false positives or somatic mutations, there is uncertainty in the

divergence time between humans and chimpanzees, the average

generation time, and effective population sizes.

Recently, other approaches for determining mutation rate have

been described. One group constructed a model of microsatellite

evolution and applied this model to estimate the time to the most

recent common ancestor (MRCA) for microsatellite alleles [12].

Because SNVs near the microsatellite have the same ancestry as

the microsatellite, the mutation rate for SNVs could be calculated

using the SNV differences between haplotypes and the time to

the MRCA [12]. Another approach to estimating mutation rate

involves the identification of heterozygous mutations in large

regions of homozygosity by recent descent (autozygosity) [7,120]

(Figure I). Such regions are particularly abundant among founder

populations, providing a means for estimating mutation rate from

a recent common ancestor in populations such as the Hutterites,

the Amish, and the Icelandic population. Although different in

many ways, these two approaches have some important simila-

rities. Both are less susceptible to false positive and somatic

mutations than are analyses of de novo mutations in trios. In

addition, both approaches estimate the time to the MRCA for

segments of the genome in different ways, but benefit by studying

haplotypes with a more recent coalescent time than humans and

chimpanzees.

Human

(A)

(B)

(C)

Chimpanzee

TRENDS in Genetics

Figure I. Methods of discovering new mutations and estimate mutation rate. (A)

Sequence data from parent–offspring trios can be used to find mutations present

in the child but not observed in either parent (red star). (B) Fixed differences

between closely related species can be identified and counted; red or green stars

represent mutations occurring in the lineage leading to humans and orange or

yellow stars represent mutations in the lineage leading to chimpanzees. This

value, in combination with the estimated number of generations between the

species, can be used to calculate mutation rate. A modification of this approach

can be used within species if the coalescent time of haplotypes can be estimated

[12]. (C) Mutations in regions of autozygosity appear as heterozygous variants in

long stretches of homozygous DNA [7,120]. With known pedigree information,

the most recent common ancestor (MRCA) of the autozygous haplotype can be

identified and the mutation rate calculated [7].

Review Trends in Genetics October 2013, Vol. 29, No. 10

576

Page 25: Trends in genetics_-_october_2013

individuals with intellectual disability, probably becausesome of these de novo CNVs were influencing the develop-ment of the disorders observed in these individuals [22].Using high-density microarrays and population geneticapproaches, the rate of CNV formation was estimated tobe 3 � 10�2 for variants >500 bp [23]. However, this rate islikely a lower boundary because selection will removedeleterious mutations from the population and most largeCNVs are estimated to be deleterious [21,23].

Notably, when considering the total number of mutatedbase pairs between SNVs and CNVs, CNVs account for thevast majority. New large CNVs (>100 kbp) are relativelyrare compared with SNVs: one new large CNV per 42births (95% Poisson CI: 23–97) [21] compared with anaverage 61 new SNVs per birth (95% CI of the mean:58–64) [5–8] (Figure 1). The average number of base pairsaffected by large CNVs is 8–25 kbp per gamete (16–50 kbp

per birth) [21], which is larger than the average of 30.5 bpper gamete observed for SNVs (61 bp per birth; Figure 1). Itis important to note that the estimates for CNVs are basedon microarray data that could not be used reliably to detectsmaller CNVs (<100 kbp); therefore, the mutational prop-erties and rates of formation of these smaller variantsremain unknown. Comparisons between the human andchimpanzee genomes also revealed that insertions anddeletions account for close to three times the number ofbases that are different compared with SNVs (3% versus1.23%) [24]. Although caution must be exercised in theestimate of the de novo rate of CNVs, the data suggest amore than 100-fold differential between the number of basepairs affected (on average) per generation, yet only athreefold difference after 12 million years of evolutionbased on chimpanzee and human genome comparisons.This may reflect significant differences in the action of

SNVs

Num

ber o

f mut

a�on

s0.

010.

11.

010

100

Indels MEIs LargeCNVs

Aneuploidies SNVs

(A) (B)

bp o

f mut

a�on

s10

100

1000

1000

010

0000

Indels MEIs LargeCNVs

Aneuploidies

TRENDS in Genetics

Figure 1. Comparison of the frequency and scale of different forms of genetic variation. There is an inverse relation between mutation size and frequency. Although single

nucleotide variants (SNVs) occur more frequently, each mutation affects only a single base pair. By contrast, large mutations, such as copy number variants (CNVs) or

chromosomal aneuploidy, are rare, yet affect thousands to millions of base pairs. In addition, although these mutations are rare, they affect more base pairs per birth on

average than do SNVs. (A) Average number of mutations of each type of variant per birth. (B) Average number of mutated bases contributed by each type of variant per

birth. Y-axis is log10 scaled in both (A) and (B). Abbreviation: MEI, mobile element insertion.

Table 1. Genome-wide estimates of SNV mutation rate

Type Number of families m(• 10S8) 95% CI % Paternal Refs

Whole genome 1 1.10 0.68–1.70 [4]

1 1.17 0.88–1.62 92% [5]

1 0.97 0.67–1.34 36% [5]

78 1.20 76% [6]

5 0.96 0.82–1.09 85% [7]

10a 1.00 74% [8]

Targeted resequencing of 430 Mbp 570b 1.36 0.34–2.70 [13]

Whole exome 209c 2.17 81% [15]

238d 1.31 [16]

175c 1.50 [14]

Indirect from microsatellites 23e 1.82 1.40–2.28 [12]

512 Mbp of autozygosity 5 1.20 0.89–1.43 [7]

aFamilies of monozygotic twins with autism.

bHalf of these families have probands with autism or schizophrenia. Mutation rate is based on ‘neutral’ sites.

cProbands are affected with autism.

dFamilies comprise proband with autism, unaffected sibling, and parents. Mutation rate for unaffected siblings is reported here.

eNumber of unrelated individuals.

Review Trends in Genetics October 2013, Vol. 29, No. 10

577

Page 26: Trends in genetics_-_october_2013

selection or radical rate changes since divergence for thesedifferent classes of mutation [25].

Other classes of genetic variationIn addition to CNVs and SNVs, there are many other formsof genetic variation that arise by completely differentmutational processes and, consequently, have distinctbiases. The largest, of course, are aneuploidies (the dupli-cation or deletion of an entire chromosome). Due to theseverity of these mutations (the most well-studied aneu-ploidy is Down syndrome), most aneuploidies are lethal inutero. Studies of spontaneous abortions and embryos cre-ated with in vitro fertilization suggest that 30–60% ofembryos and 0.3% of newborns have a chromosomal aneu-ploidy (reviewed in [26]; Figure 1). Interestingly, there aresubstantial differences between chromosomes in the inci-dence of aneuploidy; trisomies of chromosomes 16, 18, 21,and the sex chromosomes are most prevalent [27]. Chro-mosomal aneuploidies are thought to primarily arise dur-ing meiosis I through several mechanisms. Most simply,homologous chromosomes can fail to pair or stay paired inmeiosis, potentially due to lack of recombination events[28]. However, trisomies can also arise if sister chromatidsimproperly segregate during meiosis I [29] (Figure 1), andit appears as though different chromosomes may be pri-marily affected by different mechanisms [26].

Other forms of genetic variation have been less wellcharacterized, often due to methodological biases in theirdiscovery leading to reduced sensitivity. The rate of smallinsertions and deletions or ‘indels’ has been reported asapproximately 0.20 � 10�9 per site per generation forinsertions and 0.53 � 10�9–0.58 � 10�9 per site per gener-ation for deletions; this corresponds to approximately 6% ofthe SNV mutation rate [3,30] (Figure 1). Whole-genomesequence data from the 1000 Genomes Project suggestedthat each individual carries approximately one-tenth thenumber of indels compared with SNVs [31], but compari-son of two Sanger-sequenced human genomes suggested aratio closer to one-fifth [32]. The estimates from short-readsequencing must be considered conservative, because re-petitive and low complexity regions of the genome havebeen difficult to assay because short reads harboring indelsare difficult to map, especially in low complexity regions ofthe genome where this type of variation is enriched.

In addition to indels, several recent studies have focusedon the rate of MEIs. The MEI rate has been estimated to beapproximately 2.5 � 10�2 per genome per generation or 1in 20 births (for the active retrotransposons: Alu, L1, andSVA) [33] (Figure 1). It should be noted that comparativeanalyses of great ape genomes have suggested that thisrate has varied radically in different lineages over the past15 million years of human–great ape evolution. UnlikeSNVs, the rate of MEIs has been far less clocklike overthe course of evolution [34]. Within the human lineage, theinsertions of Alus constitute most MEI events with a rate of2–4.6 � 10�2 per genome per generation or approximately1 in 20 births [33,35], whereas LI and SVA insertions arerarer, occurring at 3–4 � 10�3 per genome per generation(1 per approximately 100–150 births) [33,36] and6.5 � 10�4 per genome per generation (1 per 770 births)[33], respectively. However, these rates were primarily

calculated indirectly using assumptions of the SNV muta-tion rate; therefore, additional studies based on directestimates from families are warranted. Given the lowfrequency of such occurrences and biases in terms of theirintegration into AT-rich and repetitive DNA, such analyseswill require very large sample sizes and deeply sequencedgenomes preferably with long reads to provide a reliableestimate.

Several loci in the genome are especially prone to mu-tation, including microsatellites [37], rDNA gene clusters[38], and segmental duplications (SDs) [39,40]. A recentgenome-wide analysis of over 2000 known microsatellitesin over 24 000 Icelandic trios revealed a mutation rate of2.73 � 10�4 mutations per locus per generation for dinu-cleotide repeats and approximately 10 � 10�4 mutationsper locus per generation for tetranucleotide repeats [12],which is similar to original projections based on populationgenotype data and Mendelian inconsistencies in families[37,41]. It is important to note that this rate is severalorders of magnitude greater than the rate for SNVs (basefor base), underscoring the fact that microsatellites are anextraordinary reservoir of new mutation. In addition, themutation rate of individual microsatellites increases withaverage allele length and repeat uniformity, likely becauseit is easier for DNA polymerase to slip on longer, purerrepeats [12,37,42,43] (reviewed in [44]; Figure 2). Interest-ingly, there are length constraints on di- and tetranucleo-tide repeats where very long alleles tend to mutate to shortones and vice versa [12]; in contrast, studies of loci associ-ated with trinucleotide repeat disorders indicate a polaritytoward increasing length, where mutability depends on thelength and purity of the repeat tract length (reviewed in[45]). This property, where the increasing repeat lengthincreases the probability of new mutation, has been de-scribed as dynamic mutation in contrast to the bulk ofstatic mutations in the human genome [46].

Although generated by a different mechanism involvingnonallelic homologous recombination (NAHR; Figure 2),clusters of ribosomal RNA genes (rDNA), centromericsatellites, and SDs also show extraordinary rates of muta-tion. The mutation rate for rDNA is estimated to be 0.11per gene cluster per generation, leading to an incrediblediversity of rDNA alleles [38]. Centromeric satellites arealso large regions of highly duplicated DNA where unequalcrossover is rampant [47,48]. The mutability of theseregions gives rise to large differences in chromosomallength among individuals [49]; however, the repetitivenature of these regions has made them historically difficultto study other than by Southern blot and pulsed-field gelelectrophoresis [50]. There is emerging data that SDssimilarly are highly dynamic regions of the genome andprone to recurrent mutation. Copy number polymorphisms(CNPs), for example, are significantly enriched in regionsof SDs [51,52]; 90% of CNP genes map to SDs [53,54].Similar to satellites and rDNA, this bias is due, in largepart, to the propensity for these segments to undergoNAHR [55–57]. As a result, CNPs in SDs are less likelyto be in linkage disequilibrium with nearby SNPs [58,59].In addition, significant overlap between CNV loci inhumans and nonhuman primates is likely due to recurrentmutation rather than ancestral polymorphism [60,61].

Review Trends in Genetics October 2013, Vol. 29, No. 10

578

Page 27: Trends in genetics_-_october_2013

Nonrandom distribution of new mutationsGiven the tendency for certain types of loci to mutate, it isnot surprising that new SNV and CNV mutations are notrandom. Several reported and predicted properties of newSNVs have been confirmed in recent genome-wide analyses.First, transitions outnumber transversions by twofold for denovo SNVs [4,5,30]. The rate of mutation at CpG dinucleo-tides has been observed to be ten- to 18-fold the rate of non-CpG dinucleotides [3,6,7,30]. CpG dinucleotides are pre-dicted to be more mutagenic because these are preferentialsites of cytosine methylation, and spontaneous deaminationof 5-methylcytosine yields thymine and, thus, creates acytosine to thymine mutation (Figure 2). Considering thatmost estimates of de novo mutation rate have been based onsequencing technology that biases against particularly GC-rich DNA [31,62], these current estimates probably repre-sent a lower boundary.

Several different properties besides GC content havebeen associated with variation in mutation rate, includingnucleosome occupancy and DNaseI hypersensitivity, rep-lication timing, recombination rate, transcription, andrepeat content [8,63–68]. The higher mutation ratesreported in or near protein-coding regions may beexplained in part by the higher GC content of these regions[13,15,16] in combination with the effects of transcription-associated mutations [67]. Interestingly, a recent study of

human RNA-seq data and human–macaque divergencefound that an increase of twofold in gene expression leadsto a 15% increase in mutation due to transcription-associ-ated mutagenesis (TAM) [67]. In addition, there is a strandasymmetry in mutations in transcribed regions of thegenome where mutations induced from DNA damage(C to T, A to G, G to T, and A to T) are increased on thenontranscribed strand, likely due to exposure of single-stranded DNA during transcription [66,67,69]. The tran-scribed strand, by contrast, is subject to RNA polymerasestalling leading to the recruitment of transcription coupledrepair (TCR) machinery, which corrects some mutations(reviewed in [70]). The opposing forces of TAM and TCRlead to a bias toward G and T bases on the coding strand[67,69].

Recent whole-genome sequencing studies have con-firmed the nonrandomness of mutations, which have beenreported as an enrichment for clustered de novo SNVs. Itwas recently reported that 2–3% of de novo SNVs are partof multinucleotide mutations, or mutations within 20 bp ofanother de novo SNV [71]. Similarly, a recent studyreported an enrichment of SNVs (2% of de novo variants)within 10 kbp that could not be fully explained by GCcontent or multinucleotide mutations [7]. Finally, otherrecent work [8] confirmed previous reports of large devia-tions in the distribution of de novo SNVs compared with

Deamina�on(A)

(B)

(C)

(D)

Replica�onm

m

C

CA CA

ABC

ABC

GT GT GT GTCA

CA

CA5′3′

3′5′

CA CAGT GT GT GT

GG C

Mismatchrepair

Replica�on

m

m

C GG C

m

TGGC

TGAC

Slippage

Recombina�onbetween paralogs

Mismatchrepair

Replica�on

Dele�on

Duplica�on

CA CA CA CAGT GT GT GT

CAGT

CA CA CA CAGT GT GT GT

ABC ABC

Premature lossof cohesion Telophase I Meiosis II

TRENDS in Genetics

Figure 2. Common mechanisms leading to biases in mutation. (A) CpG dinucleotides are the sites of cytosine methylation and frequent mutation. 5-methyl-cytosine can be

deaminated to thymine (red). This mutation can either be repaired by mismatch repair pathways (reviewed in [121]) or be replicated to yield a cytosine to thymine mutation.

(B) Indels can occur by polymerase slippage during replication if these events are not repaired by mismatch repair (reviewed in [121]), especially in regions of low

complexity, such as microsatellites. Replication slippage is shown (red) on the newly synthesized strand leading to an insertion. (C) Regions flanked by highly identical

segmental duplications (SDs; black boxes) are prone to nonallelic homologous recombination (NAHR). Recombination between homologous chromosomes (blue and

magenta) occurs in paralogous regions, leading to duplication of genes ABC in one of the recombined chromosomes and deletion on the other. (D) Replicated homologous

chromosomes are shown in black and gray. Premature loss of cohesion between sister chromatids can lead to separation of chromatids in meiosis I (black), leading to cells

with only one chromatid or three chromatids. Trisomy results after meiosis II, when one gamete ends up with an extra chromatid (red).

Review Trends in Genetics October 2013, Vol. 29, No. 10

579

Page 28: Trends in genetics_-_october_2013

what would be expected under a model of random mutation[66,72]. These studies suggest that a model of random SNVmutation is inaccurate at many different levels. Withadditional genome-wide mutation rate data, it should alsobe possible to assign local SNV mutation rates across thegenome. Such biases are critical to assessing the signifi-cance of new mutations at a locus-specific level with respectto disease [73], especially as the community begins toexplore the noncoding landscape.

Similar to SNVs, new CNVs are nonrandomly distrib-uted. Long stretches of highly paralogous sequences (SDsor low copy repeats) in direct orientation predispose toNAHR, which leads to deletions and duplications of theintervening sequence [39,40] (Figure 2). The process ofNAHR is involved in a greater fraction of large CNVs,and it does not contribute much to the formation of smaller(<50 kbp) CNVs [23,74], which are thought to arise as aresult of errors in replication or microhomology-mediatedmutation [75–78]. Loci flanked by paralogous sequenceshave significantly higher rates of CNV mutation comparedwith loci outside of these regions [51,79], and many of theCNVs in these regions have been strongly associated withdiseases, including developmental delay, autism, and epi-lepsy (reviewed in [80]). Within loci flanked by SDs, thereare differences in the rates of CNV formation. Thesedifferences are largely due to the presence of directlyoriented SDs and the size and level of sequence identityof the flanking duplications. Thus, larger and more identi-cal duplications provide better substrates for NAHR, lead-ing to higher rates of CNV formation [81,82] (Figure 2).Moreover, as the size of CNVs increased so did the proba-bility that the variants occurred de novo, reflecting theeffect of strong selection against such large variants [82](Figure 3). Interestingly, NAHR ‘hotspots’ often showstructural variation in the flanking SDs that mediatethe NAHR events. These structural variants lead to hap-lotypes that are prone to, and protected from, recurrentdeletion because of differences in their genomic architec-ture and content of the flanking SDs [79,83–86]. Interest-ingly, many of these ‘structural’ haplotypes occur atdifferent frequencies among human populations, leadingto differences in ethnic predilection to recurrent CNVs anddisease [86,87].

Parental bias and paternal age effectsIt has long been hypothesized and observed that moremutations arise on the paternal germline [2,88], and thisdifference is thought to be due to the larger number andcontinuous nature of cell divisions in spermatogenesis.Female eggs arise from a finite number of 22–33 celldivisions, whereas male sperm monotonically increaseevery 15–16 days as a result of mitotic maintenance ofthe spermatogonial pool (reviewed in [89]). The depen-dence of SNV mutation on replication dictates an increasein mutations with advancing paternal age [88]. Whole-genome and whole-exome sequencing studies have con-firmed the paternal bias for SNVs. The combined studiesreport that 76% (95% binomial CI = 73–80%) of newmutations arise in the paternal germline based on 497new mutations where the parental origin has been ascer-tained [6–8,15]. Multiple studies have confirmed that thenumber of de novo mutations increases with the age of thefather [6,8,15]. Yet, the data remain conflicted on themagnitude and model of this effect (Figure 4). In one studyof the whole-genome sequences of two parent–offspringtrios, for example, a paternal bias was observed in onetrio and a maternal bias in the other [5]. If the increase inde novo mutations was solely due to the increased numberof cell divisions in sperm production as a man aged, then itwould be expected that there should be a linear relationbetween paternal age and number of mutations. The datafrom these recent publications are not inconsistent with alinear model that estimates that the number of mutationsincreases by one to two mutations per year of the father’slife [6,8]. However, others have suggested that an expo-nential increase of approximately 3% per year may be aslightly better fit for this data [6]. Further studies withlarger ranges of paternal ages (especially older fathers) areneeded to resolve this issue.

An important consideration in paternal bias and ageeffects is the selective potential of de novo mutations onspermatogonial cells. Recent analysis has revealed thatmutations in several genes [e.g., encoding fibroblastgrowth factor receptor 2 and 3 (FGFR2 and FGFR3), v-Ha-ras Harvey rat sarcoma viral oncogene homolog(HRAS), and tyrosine-protein phosphatase nonreceptortype 11 (PTPN11)] likely confer growth advantages to

200

400

600

800

1000

1200

1400

1600

00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2

00.10.20.30.40.50.60.70.80.91

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Num

ber o

f CN

Vs

Minimum size of call (Mbp)

De n

ovo

prop

or�o

n

TRENDS in Genetics

Figure 3. Larger copy number variants (CNVs) are more likely to be de novo. Size distributions of CNVs from over 15 000 children with developmental delay are plotted.

Inherited CNVs are in black and de novo CNVs are in red, with the number of CNVs on the left-hand y-axis. The proportion of CNVs that are de novo is plotted in blue with

the de novo proportion on the right-hand y-axis. Reproduced from [82].

Review Trends in Genetics October 2013, Vol. 29, No. 10

580

Page 29: Trends in genetics_-_october_2013

spermatogonial cells, leading to further proliferation ofsperm carrying those mutations, even though mutationsin these genes lead to autosomal dominant disorders at theorganismal level, including Apert syndrome (FGFR2) andachondroplasia (FGFR3) [90,91]. A strong ‘paternal ageeffect’ has been observed for these disorders [92,93] withmutations in these genes at a rate exceeding linear expec-tation [94,95]. Mutations associated with these disordersare almost exclusively paternal (95–100%), gain-of-func-tion missense mutations. These observations are consis-tent with a model of selfish spermatogonial selection,where mutations confer growth advantages to spermato-gonial cells, leading to a clonal proliferation in the testisthat, in turn, contributes disproportionately to the numberof mutant sperm as a man ages [90,91]. These genes arelikely the reason that previous studies focused on selectautosomal dominant loci estimated a faster than linearincrease of mutations with paternal age [94,95]. With theexception of a few loci such as these, the available data areconsistent with a linear increase of mutations with advanc-ing paternal age [6,8], primarily as a result of increased celldivision and replication errors.

In addition to SNVs, other forms of genetic variationhave been assessed for parental origin and association withincreased parental age. Similar to SNVs, a strong paternalbias has also been reported for mutations at microsatelliteswith a paternal to maternal ratio of 3.3:1. Once again, thenumber of microsatellite mutations increases linearly withpaternal age [12]. Parental origin has also been assessedfor structural variation, albeit limited to children withdevelopmental delay where parental data were available.

A paternal bias has been observed for large chromosomalrearrangements visible by microscopy, including deletions,duplications, and translocations [96]. Similarly, CNVs(>150 kbp) also have been reported to have a paternalbias, with 90 out of 118 of all de novo CNVs arising onthe paternal haplotype (76%; binomial 95% CI = 69%-84%)[22]. This result is driven primarily by mechanisms otherthan NAHR, where no significant difference is found in thenumber of events between paternal and maternal origin.Similar to SNVs, the number of non-NAHR CNVs in-creased with paternal age [22]. So far, the only exceptionto the rule of paternal origin for new mutations andincrease with paternal age is chromosomal aneuploidy,including Down syndrome (trisomy of chromosome 21),where most mutations originate in the maternal germlineand the risk of aneuploidy increases exponentially withmaternal age (reviewed extensively in [26,27]).

New mutations, selection, and human diseaseThere has been much recent interest in identifying de novomutations that play a role in the development of humandisease; knowledge of the patterns of human mutation iscritical to the interpretation of these studies. Some broadthemes are beginning to emerge. First, it is clear thatdeleterious de novo mutations contribute significantly tohuman disease and probably have played a more impor-tant role in all diseases than previously anticipated as aresult of the super exponential increase in the humanpopulation over the past 5000 years [97–99]. Exome se-quencing revealed an increase in the number of de novoloss-of-function SNVs in individuals with autism[15,16,100] and schizophrenia [101]. The story is similarfor CNVs, where individuals with neurocognitive diseasesshow an increase in de novo CNVs [79,102–104]. Interest-ingly, individuals with autism in families with multipleaffected individuals also show an increased number of denovo CNVs compared with their siblings, even though themultiplex nature of these families would suggest a primar-ily inherited model of disease [21].

Given the data that de novo SNVs contribute to diseasein combination with an increase in mutation rate withpaternal age, there has been considerable discussion re-garding the effect of paternal age on disease [105]. Howev-er, it is important to consider the potential magnitude ofthis effect, which is likely to be modest. Even if there aretwo new mutations per year of paternal age or a doubling ofmutations every 16.5 years [6], most of these new muta-tions will be neutral and not contribute to disease. Thesedata are consistent with epidemiological data that suggesta modest, albeit significant, increase in prevalence ofdisease in children from older fathers: there is a twofoldincrease in relative risk of a child developing autism from afather over 55 years of age when compared with a fatherless than 29 years of age [106]. The notable exceptions arediseases caused by mutations in spermatogonial selectiongenes, where the effect of paternal age increases moresignificantly [91].

Inferring dates of human evolutionThe increasing number of direct analyses in human fami-lies has led to discussion aimed at resolving these new rate

15

4060

8010

012

0

20 25 30 35 40 45

01

23

45

Paternal age

2.01 muta�ons per y

0.04 exonic muta�ons per y

1.02 muta�ons per y

Num

ber o

f mut

a�on

s (ex

ome)

Num

ber o

f mut

a�on

s (ge

nom

e)

TRENDS in Genetics

Figure 4. Relation between paternal age and de novo mutations. Current fitted

models are shown of the increase in single nucleotide variant (SNV) mutations

with paternal age from whole-exome and whole-genome sequencing of parent–

offspring trios. There is some difference between the studies in regards to the

magnitude of this effect, but sample sizes were relatively low and more studies,

especially with older fathers, are needed to achieve a more precise estimate. The

paternal age is on the x-axis, the left-hand y-axis shows the number of mutations

per genome per birth and the right-hand y-axis shows the number of mutations

per exome per birth. Exome data from 189 trios yielded an increase of 0.04 exonic

mutations per year of paternal age (broken green line) [15]; the smaller number of

mutations compared with the whole-genome studies is consistent with the smaller

target (protein-coding exons). Whole-genome data from 78 trios yielded an

increase of 2.01 mutations per year (blue) [6]. Whole-genome data from ten

families yielded an increase of 1.02 mutations per year (red) [8].

Review Trends in Genetics October 2013, Vol. 29, No. 10

581

Page 30: Trends in genetics_-_october_2013

estimates with our knowledge of important dates in humanevolution. This stems from the fact that the mutation ratescalculated directly in human families are approximatelyhalf of that calculated based on sequence divergence andfossil record [107,108]. As a result of these updated muta-tion rates, generation times in the great ape lineages maybe longer than previously thought [107]. Taken together,this pushes divergence times further back, and these datesare more in line with the fossil record in some cases butseem ridiculous in others (see [107,108] for a detaileddiscussion). However, if mutation rates calculated fromwhole-genome sequencing of human families represent alower boundary as discussed above, then rates from directand indirect approaches would be more concordant and thelengthening of divergence times would be overestimated.Moreover, there is also considerable uncertainty in termsof the effect of paternal age with respect to ancestralpopulations, and this may account for some of the differ-ence between direct and indirect estimates of mutationrate. Adding to the complexity, there is good evidence thatmutation rates have not remained constant over evolution-ary time with a slowdown in hominids, likely a conse-quence of generational time [9,109]. Outside of humans,there is little genome-wide data on the extent of thisslowdown, even among closely related species.

Concluding remarksOver the past few years, genomic technologies have made itpossible to obtain direct knowledge concerning rates ofhuman mutation. Recent studies are converging on similarSNV mutation rates, quantifying the male mutation biasand its relation with paternal age. The current rate esti-mate for SNVs likely represents a lower boundary becauseof biases in next-generation sequencing technology [31,62]and the stringent filtering required to remove false positivecalls. In addition, we have gained new insight into themutational properties of large CNVs, their regional biaseswithin the genome, and their genomic impact. However,our understanding of the properties of human mutation isfar from complete. Many studies have focused on identify-ing de novo mutations in individuals with disease, and thismay introduce biases in our understanding of the naturalprocesses of mutation. Large studies of individuals fromrelatively healthy families will provide valuable insightinto the general patterns of mutation. It also remainsunclear how mutation rate increases with paternal ageand the number of genes subject to spermatogonial selec-tion. Many of the recent de novo mutations associated withautism have been found in genes potentially important incell growth and chromatin modification; it is possible thatmutations in these also confer growth advantage in thetestis. One approach may be to sequence more familieswith many children or children born from particularly oldfathers. It will also be important to sequence DNA frommultigeneration families to understand what fraction ofnew mutations discovered specifically in the blood aretransmitted to the next generation. In light of the impor-tance of new mutations in understanding evolution, effortsto sequence genomes from nonhuman primate familiesshould be a high priority to understand how the ratehas changed in different lineages. Although discussed

briefly, we are still lacking reliable estimates of the muta-tion rate and the complexity of short indels and smallerCNVs, especially those mapping within SDs. One promis-ing approach would be to use sequencing of large-insertclones to phase long haplotypes fully [110], which wouldallow parental origin to be determined for all de novomutations and enable better interpretation of indels. Un-derstanding the mutation rate of SDs and centromericsatellite sequences will likely require single molecularsequencing with very long reads (>50 kbp) [111,112] andaccurate de novo assembly.

Although we are beginning to understand the pattern ofgermline mutation, somatic mutation processes are largelyunknown outside of cancer studies. Somatic mutations,however, have the potential to contribute to diseases otherthan cancer and may be subjected to different mutationalbiases as a result of differences in repair and replicationbetween meiotic and mitotic tissues (reviewed in [113,114]).Such mutations can be identified as genetic differenceseither between tissues from the same donor or differencesbetween monozygotic twins. Given the proportion of thesomatic mutation compared with the germline alleles in apopulation of cells or a tissue sample and with some assump-tions, one can currently estimate approximately where indevelopment the mutation occurred [114,115]. There iscompelling evidence that somatic structural variants accu-mulate with age, likely as a result of an increasing number ofreplication copy errors [116]. The continued development ofsingle-cell whole-genome sequencing technologies will rev-olutionize this area of research. It has already enabledanalysis of somatic mutations in tumor samples [117],embryos [118], and haplotype phasing of individual cells[119]. Its application to sperm and egg will enable thecalculation of the true germline mutation rate and providedata on effects of positive and negative selection of muta-tions within germ cells. Such technologies coupled withadvances in genome sequencing will ultimately allow scien-tists to generate ontogenic maps of mutation tracking theorigin and fate of somatic mutations during the developmentof organisms.

AcknowledgmentsWe thank Santhosh Girirajan and Bradley Coe for sharing data andfigures. We are grateful to Andrew Wilkie, Anne Goriely, and PeterSudmant for helpful discussions and to Tonia Brown for assistance withmanuscript preparation. We would like to thank Jacob Michaelson andJonathan Sebat for sharing a prepublication version of their manuscript.C.D.C. was supported by a Ruth L. Kirschstein National Research ServiceAward (NRSA; F32HG006070). E.E.E. is an Investigator of the HowardHughes Medical Institute.

References1 Haldane, J.B.S. (1935) The rate of spontaneous mutation of a human

gene. J. Genet. 31, 317–3262 Haldane, J.B. (1947) The mutation rate of the gene for haemophilia,

and its segregation ratios in males and females. Ann. Eugen. 13,262–271

3 Kondrashov, A.S. (2003) Direct estimates of human per nucleotidemutation rates at 20 loci causing Mendelian diseases. Hum. Mutat. 21,12–27

4 Roach, J.C. et al. (2010) Analysis of genetic inheritance in a familyquartet by whole-genome sequencing. Science 328, 636–639

5 Conrad, D.F. et al. (2011) Variation in genome-wide mutation rateswithin and between human families. Nat. Genet. 43, 712–714

Review Trends in Genetics October 2013, Vol. 29, No. 10

582

Page 31: Trends in genetics_-_october_2013

6 Kong, A. et al. (2012) Rate of de novo mutations and the importance offather’s age to disease risk. Nature 488, 471–475

7 Campbell, C.D. et al. (2012) Estimating the human mutation rateusing autozygosity in a founder population. Nat. Genet. 44, 1277–1281

8 Michaelson, Jacob J. et al. (2012) Whole-genome sequencing inautism identifies hot spots for de novo germline mutation. Cell151, 1431–1442

9 Li, W.H. and Tanimura, M. (1987) The molecular clock runs moreslowly in man than in apes and monkeys. Nature 326, 93–96

10 Nachman, M.W. and Crowell, S.L. (2000) Estimate of the mutationrate per nucleotide in humans. Genetics 156, 297–304

11 Scally, A. et al. (2012) Insights into hominid evolution from the gorillagenome sequence. Nature 483, 169–175

12 Sun, J.X. et al. (2012) A direct characterization of human mutationbased on microsatellites. Nat. Genet. 44, 1161–1165

13 Awadalla, P. et al. (2010) Direct measure of the de novo mutation ratein autism and schizophrenia cohorts. Am. J. Hum. Genet. 87, 316–324

14 Neale, B.M. et al. (2012) Patterns and rates of exonic de novomutations in autism spectrum disorders. Nature 485, 242–245

15 O’Roak, B.J. et al. (2012) Sporadic autism exomes reveal a highlyinterconnected protein network of de novo mutations. Nature 485,246–250

16 Sanders, S.J. et al. (2012) De novo mutations revealed by whole-exomesequencing are strongly associated with autism. Nature 485, 237–241

17 Scherer, S.W. et al. (2007) Challenges and standards in integratingsurveys of structural variation. Nat. Genet. 39, S7–S15

18 Lupski, J.R. (2007) Genomic rearrangements and sporadic disease.Nat. Genet. 39, S43–S47

19 Turner, D.J. et al. (2008) Germline rates of de novo meiotic deletionsand duplications causing several genomic disorders. Nat. Genet. 40,90–95

20 Egan, C.M. et al. (2007) Recurrent DNA copy number variation in thelaboratory mouse. Nat. Genet. 39, 1384–1389

21 Itsara, A. et al. (2010) De novo rates and selection of large copynumber variation. Genome Res. 20, 1469–1481

22 Hehir-Kwa, J.Y. et al. (2011) De novo copy number variants associatedwith intellectual disability have a paternal origin and age bias. J.Med. Genet. 48, 776–778

23 Conrad, D.F. et al. (2010) Origins and functional impact of copynumber variation in the human genome. Nature 464, 704–712

24 Chimpanzee Sequencing and Analysis Consortium (2005) Initialsequence of the chimpanzee genome and comparison with thehuman genome. Nature 437, 69–87

25 Marques-Bonet, T. et al. (2009) A burst of segmental duplications inthe genome of the African great ape ancestor. Nature 457, 877–881

26 Nagaoka, S.I. et al. (2012) Human aneuploidy: mechanisms and newinsights into an age-old problem. Nat. Rev. Genet. 13, 493–504

27 Hassold, T. and Hunt, P. (2001) To err (meiotically) is human: thegenesis of human aneuploidy. Nat. Rev. Genet. 2, 280–291

28 Henderson, S.A. and Edwards, R.G. (1968) Chiasma frequency andmaternal age in mammals. Nature 218, 22–28

29 Angell, R.R. (1991) Predivision in human oocytes at meiosis I: amechanism for trisomy formation in man. Hum. Genet. 86, 383–387

30 Lynch, M. (2010) Rate, molecular spectrum, and consequences ofhuman mutation. Proc. Natl. Acad. Sci. U.S.A. 107, 961–968

31 The 1000 Genomes Project Consortium (2012) An integrated map ofgenetic variation from 1,092 human genomes. Nature 491, 56–65

32 Chen, J.Q. et al. (2009) Variation in the ratio of nucleotidesubstitution and indel rates across genomes in mammals andbacteria. Mol. Biol. Evol. 26, 1523–1531

33 Stewart, C. et al. (2011) A comprehensive map of mobile elementinsertion polymorphisms in humans. PLoS Genet. 7, e1002236

34 Locke, D.P. et al. (2011) Comparative and demographic analysis oforang-utan genomes. Nature 469, 529–533

35 Cordaux, R. et al. (2006) Estimating the retrotransposition rate ofhuman Alu elements. Gene 373, 134–137

36 Ray, D.A. and Batzer, M.A. (2011) Reading TE leaves: new approachesto the identification of transposable element insertions. Genome Res.21, 813–820

37 Weber, J.L. and Wong, C. (1993) Mutation of human short tandemrepeats. Hum. Mol. Genet. 2, 1123–1128

38 Stults, D.M. et al. (2008) Genomic architecture and inheritance ofhuman ribosomal RNA gene clusters. Genome Res. 18, 13–18

39 Lupski, J.R. (1998) Genomic disorders: structural features of thegenome can lead to DNA rearrangements and human diseasetraits. Trends Genet. 14, 417–422

40 Bailey, J.A. et al. (2002) Recent segmental duplications in the humangenome. Science 297, 1003–1007

41 Whittaker, J.C. et al. (2003) Likelihood-based estimation ofmicrosatellite mutation rates. Genetics 164, 781–787

42 Eichler, E.E. et al. (1994) Length of uninterrupted CGG repeatsdetermines instability in the FMR1 gene. Nat. Genet. 8, 88–94

43 Ballantyne, K.N. et al. (2010) Mutability of Y-chromosomalmicrosatellites: rates, characteristics, molecular bases, and forensicimplications. Am. J. Hum. Genet. 87, 341–353

44 Ellegren, H. (2004) Microsatellites: simple sequences with complexevolution. Nat. Rev. Genet. 5, 435–445

45 McMurray, C.T. (2010) Mechanisms of trinucleotide repeat instabilityduring human development. Nat. Rev. Genet. 11, 786–799

46 Richards, R.I. and Sutherland, G.R. (1997) Dynamic mutation:possible mechanisms and significance in human disease. TrendsBiochem. Sci. 22, 432–436

47 Waye, J.S. and Willard, H.F. (1986) Structure, organization, andsequence of alpha satellite DNA from human chromosome 17:evidence for evolution by unequal crossing-over and an ancestralpentamer repeat shared with the human X chromosome. Mol. Cell.Biol. 6, 3156–3165

48 Alkan, C. et al. (2004) The role of unequal crossover in alpha-satelliteDNA evolution: a computational analysis. J. Comput. Biol. 11,933–944

49 Mahtani, M.M. and Willard, H.F. (1990) Pulsed-field gel analysisof alpha-satellite DNA at the human X chromosome centromere:high-frequency polymorphisms and array size estimate. Genomics7, 607–613

50 Warburton, P.E. and Willard, H.F. (1990) Genomic analysis ofsequence variation in tandemly repeated DNA. Evidence forlocalized homogeneous sequence domains within arrays of alpha-satellite DNA. J. Mol. Biol. 216, 3–16

51 Sharp, A.J. et al. (2005) Segmental duplications and copy-numbervariation in the human genome. Am. J. Hum. Genet. 77, 78–88

52 Redon, R. et al. (2006) Global variation in copy number in the humangenome. Nature 444, 444–454

53 Bailey, J.A. and Eichler, E.E. (2006) Primate segmental duplications:crucibles of evolution, diversity and disease. Nat. Rev. Genet. 7,552–564

54 Bailey, J.A. et al. (2008) Human copy number polymorphic genes.Cytogenet. Genome Res. 123, 234–243

55 Conrad, D.F. et al. (2010) Mutation spectrum revealed by breakpointsequencing of human germline CNVs. Nat. Genet. 42, 385–391

56 Kidd, J.M. et al. (2010) A human genome structural variationsequencing resource reveals insights into mutational mechanisms.Cell 143, 837–847

57 Mills, R.E. et al. (2011) Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65

58 Locke, D.P. et al. (2006) Linkage disequilibrium and heritability ofcopy-number polymorphisms within duplicated regions of the humangenome. Am. J. Hum. Genet. 79, 275–290

59 Campbell, C.D. et al. (2011) Population-genetic properties ofdifferentiated human copy-number polymorphisms. Am. J. Hum.Genet. 88, 317–332

60 Perry, G.H. et al. (2006) Hotspots for copy number variationin chimpanzees and humans. Proc. Natl. Acad. Sci. U.S.A. 103,8006–8011

61 Lee, A.S. et al. (2008) Analysis of copy number variation in the rhesusmacaque genome identifies candidate loci for evolutionary and humandisease studies. Hum. Mol. Genet. 17, 1127–1136

62 Bentley, D.R. et al. (2008) Accurate whole human genome sequencingusing reversible terminator chemistry. Nature 456, 53–59

63 Stamatoyannopoulos, J.A. et al. (2009) Human mutation rateassociated with DNA replication timing. Nat. Genet. 41, 393–395

64 Ying, H. et al. (2010) Evidence that localized variation in primatesequence divergence arises from an influence of nucleosomeplacement on DNA repair. Mol. Biol. Evol. 27, 637–649

65 Chen, C.L. et al. (2010) Impact of replication timing on non-CpG andCpG substitution rates in mammalian genomes. Genome Res. 20,447–457

Review Trends in Genetics October 2013, Vol. 29, No. 10

583

Page 32: Trends in genetics_-_october_2013

66 Hodgkinson, A. and Eyre-Walker, A. (2011) Variation in the mutationrate across mammalian genomes. Nat. Rev. Genet. 12, 756–766

67 Park, C. et al. (2012) Genomic evidence for elevated mutation rates inhighly expressed genes. EMBO Rep. 13, 1123–1129

68 Koren, A. et al. (2012) Differential relationship of DNA replicationtiming to different forms of human mutation and variation. Am. J.Hum. Genet. 91, 1033–1040

69 Green, P. et al. (2003) Transcription-associated mutationalasymmetry in mammalian evolution. Nat. Genet. 33, 514–517

70 Hanawalt, P.C. and Spivak, G. (2008) Transcription-coupled DNArepair: two decades of progress and surprises. Nat. Rev. Mol. Cell Biol.9, 958–970

71 Schrider, D.R. et al. (2011) Pervasive multinucleotide mutationalevents in eukaryotes. Curr. Biol. 21, 1051–1054

72 Matassi, G. et al. (1999) Chromosomal location effects on genesequence evolution in mammals. Curr. Biol. 9, 786–791

73 O’Roak, B.J. et al. (2012) Multiplex targeted sequencing identifiesrecurrently mutated genes in autism spectrum disorders. Science 338,1619–1622

74 Kidd, J.M. et al. (2010) Characterization of missing human genomesequences and copy-number polymorphic insertions. Nat. Methods 7,365–371

75 Smith, C.E. et al. (2007) Template switching during break-inducedreplication. Nature 447, 102–105

76 Lee, J.A. et al. (2007) A DNA replication mechanism for generatingnonrecurrent rearrangements associated with genomic disorders. Cell131, 1235–1247

77 Payen, C. et al. (2008) Segmental duplications arise from Pol32-dependent repair of broken forks through two alternativereplication–based mechanisms. PLoS Genet. 4, e1000175

78 Hastings, P.J. et al. (2009) Mechanisms of change in gene copynumber. Nat. Rev. Genet. 10, 551–564

79 Sharp, A.J. et al. (2006) Discovery of previously unidentified genomicdisorders from the duplication architecture of the human genome.Nat. Genet. 38, 1038–1042

80 Mefford, H.C. and Eichler, E.E. (2009) Duplication hotspots, raregenomic disorders, and common disease. Curr. Opin. Genet. Dev.19, 196–204

81 Liu, P. et al. (2011) Frequency of nonallelic homologous recombinationis correlated with length of homology: evidence that ectopic synapsisprecedes ectopic crossing-over. Am. J. Hum. Genet. 89, 580–588

82 Cooper, G.M. et al. (2011) A copy number variation morbidity map ofdevelopmental delay. Nat. Genet. 43, 838–846

83 Osborne, L.R. et al. (2001) A 1.5 million-base pair inversionpolymorphism in families with Williams-Beuren syndrome. Nat.Genet. 29, 321–325

84 Koolen, D.A. et al. (2006) A new chromosome 17q21.31 microdeletionsyndrome associated with a common inversion polymorphism. Nat.Genet. 38, 999–1001

85 Zody, M.C. et al. (2008) Evolutionary toggling of the MAPT 17q21.31inversion region. Nat. Genet. 40, 1076–1108

86 Antonacci, F. et al. (2010) A large, complex structural polymorphism at16p12.1 underlies microdeletion disease risk. Nat. Genet. 42, 745–750

87 Steinberg, K.M. et al. (2012) Structural diversity and African origin ofthe 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880

88 Crow, J.F. (2000) The origins, patterns and implications of humanspontaneous mutation. Nat. Rev. Genet. 1, 40–47

89 Hurst, L.D. and Ellegren, H. (1998) Sex biases in the mutation rate.Trends Genet. 14, 446–452

90 Goriely, A. et al. (2003) Evidence for selective advantage of pathogenicFGFR2 mutations in the male germ line. Science 301, 643–646

91 Goriely, A. and Wilkie, A.O. (2012) Paternal age effect mutations andselfish spermatogonial selection: causes and consequences for humandisease. Am. J. Hum. Genet. 90, 175–200

92 Cohen, M.M., Jr et al. (1992) Birth prevalence study of the Apertsyndrome. Am. J. Med. Genet. 42, 655–659

93 Orioli, I.M. et al. (1995) Effect of paternal age in achondroplasia,thanatophoric dysplasia, and osteogenesis imperfecta. Am. J. Med.Genet. 59, 209–217

94 Risch, N. et al. (1987) Spontaneous mutation and parental age inhumans. Am. J. Hum. Genet. 41, 218–248

95 Crow, J.F. (1997) The high spontaneous mutation rate: is it a healthrisk? Proc. Natl. Acad. Sci. U.S.A. 94, 8380–8386

96 Thomas, N.S. et al. (2006) Parental and chromosomal origin ofunbalanced de novo structural chromosome abnormalities in man.Hum. Genet. 119, 444–450

97 Nelson, M.R. et al. (2012) An abundance of rare functional variantsin 202 drug target genes sequenced in 14,002 people. Science 337,100–104

98 Keinan, A. and Clark, A.G. (2012) Recent explosive human populationgrowth has resulted in an excess of rare genetic variants. Science 336,740–743

99 Tennessen, J.A. et al. (2012) Evolution and functional impact of rarecoding variation from deep sequencing of human exomes. Science 337,64–69

100 Iossifov, I. et al. (2012) De novo gene disruptions in children on theautistic spectrum. Neuron 74, 285–299

101 Xu, B. et al. (2011) Exome sequencing supports a de novo mutationalparadigm for schizophrenia. Nat. Genet. 43, 864–868

102 de Vries, B.B. et al. (2005) Diagnostic genome profiling in mentalretardation. Am. J. Hum. Genet. 77, 606–616

103 Sebat, J. et al. (2007) Strong association of de novo copy numbermutations with autism. Science 316, 445–449

104 Walsh, T. et al. (2008) Rare structural variants disrupt multiplegenes in neurodevelopmental pathways in schizophrenia. Science320, 539–543

105 Kondrashov, A. (2012) Genetics: The rate of human mutation. Nature488, 467–468

106 Hultman, C.M. et al. (2011) Advancing paternal age and risk ofautism: new evidence from a population-based study and a meta-analysis of epidemiological studies. Mol. Psychiatry 16, 1203–1212

107 Langergraber, K.E. et al. (2012) Generation times in wildchimpanzees and gorillas suggest earlier divergence times ingreat ape and human evolution. Proc. Natl. Acad. Sci. U.S.A. 109,15716–15721

108 Scally, A. and Durbin, R. (2012) Revising the human mutation rate:implications for understanding human evolution. Nat. Rev. Genet. 13,745–753

109 Elango, N. et al. (2006) Variable molecular clocks in hominoids. Proc.Natl. Acad. Sci. U.S.A. 103, 1370–1375

110 Kitzman, J.O. et al. (2011) Haplotype-resolved genome sequencing ofa Gujarati Indian individual. Nat. Biotechnol. 29, 59–63

111 Branton, D. et al. (2008) The potential and challenges of nanoporesequencing. Nat. Biotechnol. 26, 1146–1153

112 Eid, J. et al. (2009) Real-time DNA sequencing from single polymerasemolecules. Science 323, 133–138

113 Erickson, R.P. (2010) Somatic gene mutation and human diseaseother than cancer: an update. Mutat. Res. 705, 96–106

114 Frank, S.A. (2010) Evolution in health and medicine Sacklercolloquium: Somatic evolutionary genomics: mutations duringdevelopment cause highly variable genetic mosaicism with risk ofcancer and neurodegeneration. Proc. Natl. Acad. Sci. U.S.A. 107(Suppl. 1), 1725–1730

115 Abyzov, A. et al. (2012) Somatic copy number mosaicism in humanskin revealed by induced pluripotent stem cells. Nature http://dx.doi.org/10.1038/nature11629

116 Forsberg, L.A. et al. (2012) Age-related somatic structural changes inthe nuclear genome of human blood cells. Am. J. Hum. Genet. 90,217–228

117 Navin, N. et al. (2011) Tumour evolution inferred by single-cellsequencing. Nature 472, 90–94

118 Voet, T. et al. (2011) Breakage-fusion-bridge cycles leading to inv dupdel occur in human cleavage stage embryos. Hum. Mutat. 32, 783–793

119 Fan, H.C. et al. (2011) Whole-genome molecular haplotyping of singlecells. Nat. Biotechnol. 29, 51–57

120 Alkuraya, F.S. (2010) Autozygome decoded. Genet. Med. 12, 765–771121 Li, G.M. (2008) Mechanisms and functions of DNA mismatch repair.

Cell Res. 18, 85–98

Review Trends in Genetics October 2013, Vol. 29, No. 10

584

Page 33: Trends in genetics_-_october_2013

Many ways to die, one way to arrive:how selection acts through pregnancyElizabeth A. Brown1, Maryellen Ruvolo1, and Pardis C. Sabeti2,3,4

1 Department of Human Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA2 Center for Systems Biology, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138,

USA3 Broad Institute of the Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142, USA4 Department of Immunology and Infectious Diseases, Harvard School of Public Health, Boston, MA 02115, USA

When considering selective forces shaping human evolu-tion, the importance of pregnancy to fitness should not beunderestimated. Although specific mortality factors mayonly impact upon a fraction of the population, birth is afunnel through which all individuals must pass. Humanpregnancy places exceptional energetic, physical, andimmunological demands on the mother to accommodatethe needs of the fetus, making the woman more vulnera-ble during this time-period. Here, we examine how meta-bolic imbalances, infectious diseases, oxygen deficiency,and nutrient levels in pregnancy can exert selective pres-sures on women and their unborn offspring. Numerouscandidate genes under selection are being revealed bynext-generation sequencing, providing the opportunityto study further the relationship between selection andpregnancy. This relationship is important to consider togain insight into recent human adaptations to uniquediets and environments worldwide.

Selection and pregnancySome of the earliest records of mortality from London inJohn Graunt’s ‘Bills of Mortality’ for 1632 reveal severaldistinct causes of death in the population [1]. Any specificcause of death affects only a fraction of the population,lessening the importance of each particular factor forfitness (Figure 1). Managing to be born, however, is auniversal requirement for fitness. Thus, factors that influ-ence fecundity and pregnancy are likely to shape humanevolution strongly.

The many physiological compromises of pregnancymake it a tremendous challenge for both mothers andinfants, and a potential selective force. To provide forthe growing fetus mothers increase blood sugar [2], bloodvolume, and hemoglobin count [3]; remodel uterine arter-ies [4]; and decrease vascular resistance [5]. These changesput the mother at risk of diabetes, high blood pressure,strokes, hemorrhaging, and seizures [2,6–8]. Moreover,properties of the immune system are downregulated toprevent immune response to the ‘foreign’ fetus, potentiallycontributing to the greater susceptibility of pregnant wom-en to infectious disease [9].

These difficulties for mothers also translate into pro-blems for infants: pre-industrial data show that nearly aquarter of babies died during labor and infancy, whereasmaternal mortality was nearly 1.5% per birth due toinfectious diseases, diabetes, eclampsia, and jaundice[10]. Similarly, modern foraging populations and sub-Saharan African nations in 1970 also had infant mortal-ity rates of 20–25%, in contrast to Norway, for example,at only �1.6% [11,12]. Maternal mortality in sub-Saharan Africa was �1.0% in the year 2000 (comparableto 16th and 17th century England) with hemorrhage,hypertension (preeclampsia/eclampsia), and infectiousdiseases as the major causes. By contrast, maternalmortality in Northern Europe was only 0.02% in theyear 2000 [13]. These data from historic, foraging, anddeveloping country populations only serve as rough prox-ies for the conditions facing humans during recent evo-lution, but they give some indication of the difficulty ofpregnancy experienced by pre-modern foraging and Neo-lithic populations.

In addition to the challenges of pregnancy, the numberof babies a woman births, compounded across genera-tions, can have huge evolutionary impact. For example,landless Finnish women living 1760–1849 had an aver-age of 4.27 babies, whereas landowning women had anaverage of 4.55 babies: a change in absolute fitness of thismagnitude would cause a geometric rise in the number ofdescendants in a few generations [14] (Figure 2A). Thenutritional benefits of the Industrial Revolution (ca 1880)boosted average Finnish fertility to 5.3 babies [15]. Anysuch increase in fertility from either environmental orgenetic factors will dramatically increase the fitness ofwomen (Figure 2B). An earlier revolution, the develop-ment of agriculture and pastoralism, may have conferredsimilar fertility benefits, especially to women with genet-ic mutations allowing them to exploit these new resourcesmaximally – lactase persistence, described below, may bean example of this [16]. Furthermore, changes in femalefertility could have played an important role duringhuman population migrations. For example, a large studyof Quebecois settlers indicated that women on the wave-front of territory expansion had a 15–20% fertility advan-tage, with a heritable component for fertility, suggestingthat genes influencing fertility may be shaped by selec-tion [17].

Review

0168-9525/$ – see front matter

� 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.03.001

Corresponding author: Sabeti, P.C. ([email protected], [email protected]).Keywords: selection; pregnancy; human evolution; gestational diabetes; preeclampsia.

Trends in Genetics, October 2013, Vol. 29, No. 10 585

Page 34: Trends in genetics_-_october_2013

Considering the impact of female fertility alongside thechallenges of pregnancy may be critical for understandingrecent human adaptations. This review explores how se-lection may have acted through pressures on mothers andinfants during pregnancy given the changing environment,diet, and behavior of the past 10 000 years. These factorsare critical to bear in mind as opportunities for evolution-ary geneticists to generate new adaptive hypotheses pro-liferate, fueled by next-generation sequencing data andnew statistical tools for predicting adaptive variants indiverse populations.

Metabolic disorders and selection during pregnancyTheories of human adaptation surrounding metabolicdisorders, such as hypertension and type 2 diabetes, areconstrained by the fact that these diseases typicallystrike at post-reproductive ages. The related disordersof gestational diabetes mellitus (GDM) and preeclampsia

(hypertension in pregnancy), however, occur preciselyduring the critical reproductive period of pregnancy.GDM occurs as the maternal blood glucose level rises tonourish the fetus, increasing the risk of maternal diabetes[18]. Preeclampsia occurs as a mother increases bloodvolume and remodels vasculature for fetal ventilation,raising the risk of maternal hypertension [6]. Womenpredisposed for these conditions can be pushed into meta-bolic dysfunction.

GDM and preeclampsia are common diseases, with graveconsequences in pregnancy, and thus may strongly impactupon reproductive fitness. GDM affects 4–20% of pregnan-cies in different populations worldwide [19]. It can causemacrosomia, in which the fetus grows too large to fit throughthe maternal pelvis [20–23]. Before the advent of caesariansections (C-sections), GDM could lead to fetal morbidity andmortality, and maternal hemorrhage and tearing duringdelivery [7,20]. Preeclampsia is the leading cause of mater-nal mortality worldwide, accounting for 10–19% of deaths[24–26]. It can cause fetal hypoxia and oxidative stress,low birthweight, and maternal hemorrhage and seizures

Infant mortality

Tuberculosis

FeverPoxviruses

Teeth

Edema

Diarrhea

Other infec�ons

Violent deaths

Aged over 60

ConvulsionOther

Unclear S�llborn

Childbed

Chronic respiratory diseasesFlu and pneumonia

Kidney diseaseAccidents

Suicide

Heart disease

Cancer

Alzheimer’s

Diabetes

Stroke

Other

Reported causes of death in London, 1632

Ten leading causes of death in USA, 2009

(A)

(B)

TRENDS in Genetics

Figure 1. Multiple varied causes of death in modern and historic populations. (A)

Many different factors caused death for individuals who died in London in 1632 [1].

‘Childbed’ referred to mothers who died during or after labor, often due to

infections. Over a quarter of deaths occurred in infants and unborn fetuses. (B) By

contrast, the leading causes of death in modern, developed countries, such as the

USA in 2009, are very different, with heart disease and cancer accounting for fully

half of the deaths [93].

0 160 320 480 640 800Genera�ons

s = 0.126s = 0.082s = 0.033

1 2 3 4 5 6 7 8 9 100

Genera�ons

Fer�lity in Finnish women

Key:

Key:

Selec�on corresponding to differences in fer�lity

1880 Fer�lity boost (5.3 babies)1760–1849 Landowning (4.55 babies)1760–1849 Landless (4.27 babies)

(A)

(B)

Num

ber

of d

augh

ters

(tho

usan

ds)

Alle

le fr

eque

ncy

2

4

6

8

10

12

14

16

0.2

0.4

0.6

0.8

1

TRENDS in Genetics

Figure 2. Rapid change in prevalence of fertility-enhancing traits. (A) The increase

in number of female descendants (y axis in thousands), compounded across

generations, for maternal lineages with an average of 5.3, 4.55, or 4.27 babies over

a lifetime, based on pre-industrial data on differences in female fertility in Finland

[14,15]. (B) The increase in frequency of new mutations conferring fertility

advantages that correspond to the differences in fertility for the three groups of

Finnish women (selection coefficient s = 0.126 for 5.3 vs 4.27 babies; s = 0.82 for 5.3

vs 4.55 babies; s = 0.033 for 4.55 vs 4.27 babies). This demonstrates how readily

any mutation with a positive impact on female reproduction will sweep through a

population over a very short time due to the compounding effect across

generations.

Review Trends in Genetics October 2013, Vol. 29, No. 10

586

Page 35: Trends in genetics_-_october_2013

(eclampsia) if not treated by premature delivery [24] (seeBox 1 for a discussion of high-altitude adaptation and therisks of preeclampsia).

The rates of GDM and preeclampsia vary significantlyin different populations, even when controlling for envi-ronmental factors such as obesity [27,28]. This raises thepossibility that selective pressures during pregnancy havefine-tuned metabolism to suit different environments anddiets around the world, resulting in the current distribu-tion of disease prevalence. By contrast, alternative expla-nations, discussed in Box 2, may also account for thesepatterns – distinguishing between these competing hy-potheses is an important avenue for future research.

Intriguingly, the incidence of GDM in modern popula-tions is inversely related to traditional consumption ofdietary components known to increase risk for diabetesand GDM (Table 1). These include high glycemic carbohy-drates, which produce large glucose responses in the blood,and dairy products, which produce large insulin responsesdue to the effect of whey proteins [29–33]. Europeans havethe lowest prevalence of GDM in the world – 3.6% in astudy of over a million births in New York City (NYC) [19] –but have the longest history of high glycemic diets. In thepast 10 000 years European grain-based agriculture in-creased carbohydrate consumption to roughly 70% of diet,whereas hunter-gatherers consume only 3–50% [34]. In thepast 8000 years Europeans also began consuming dairyproducts in large quantities [35]. By comparison, SouthCentral Asians had a much higher incidence of GDM in theNYC cohort (14.3%), with Bangladeshis the highest at21.2% [19]. Traditionally, Bangladeshis have had highconsumption of fish, a low glycemic food; rice, of moderateglycemic index due to little processing; and no dairy[36,37]. Finally, among African-Americans, the incidence

of GDM was intermediate at 4.3% [19]. This is consistentwith their admixed ancestry and the mixed consumption ofdairy across populations in West Africa, the origin of mostUS African-Americans.

Given the inverse correlation between traditional con-sumption of dietary components increasing GDM risk andcurrent incidence of GDM, high glycemic foods and dairymay have acted as selective agents on metabolism duringpregnancy. Because GDM is very likely to have a geneticbasis – 67% of the risk of type 2 diabetes for adults youngerthan 60 is heritable [38], and women with GDM have a 7- to12-fold elevated risk for type 2 diabetes [39,40] – naturalselection can act on its underlying risk factors. Therefore,any population environmentally at risk for GDM withoutaccess to C-sections should experience selection againstgenetic risk factors for GDM. Conversely, any populationwithout access to high glycemic food items should experi-ence selection to make blood sugars more available to thefetus, perhaps through increasing insulin resistance byincreasing the frequency of genetic risk factors for GDM.Supporting these predictions, evidence suggests Eur-opeans may have a blunted glycemic response to food

Box 1. Oxygen and selection during pregnancy

Another environmental pressure detrimental to pregnant women is

high-altitude hypoxia. When brought to high altitudes, people from

sealevel populations increase hemoglobin levels to carry more

oxygen to the tissues. With long-term exposure and old age,

increased hemoglobin causes altitude sickness and even death.

However, pregnant women experience a special danger: preeclamp-

sia caused by oxygen-restriction for the fetus. As described in the

main text, preeclampsia often results in premature labor, small

birthweight babies, and hemorrhaging, seizures, and death for the

mother [24].

Tibetans, Andeans, and the Ethiopian Amhara have each adapted

to hypoxic high-altitude conditions possibly due to its impact on

pregnancy. In these populations, strong signatures of selection

surround genetic loci related to hypoxia and hemoglobin concen-

tration, including EGLN1, EPAS1, PPARA, THRB, and ARNT2 [94–97].

However, Andeans are still at risk for altitude sickness in old age

because they exhibit the same elevated hemoglobin levels of

lowlanders at high altitudes, indicating that selection for post-

reproductive survival was not the primary force in this population

[98]. Even so, some studies find that Andeans and Tibetans giving

birth at high altitudes have fewer instances of low fetal birthweights

and preeclampsia than do lowlanders at high altitudes, possibly due

to increased uterine capillary density [99–101]. Also, some genes

under selection among the Amhara are involved in fetal hemoglobin

levels (BCL11A) and angiogenesis (AIMP1 and VAV3), an important

feature of pregnancy [94]. These pieces of evidence indicate that

pressures during pregnancy may have been significant in adapting

to high-altitude hypoxia for Tibetans, Andeans, and the Amhara.

Box 2. Alternative hypotheses and avenues of research

Although the evidence described in the main text support the

importance of pregnancy to recent selection in humans, alternative

hypotheses could also explain some phenomena that we argue

suggest selection in pregnancy. Take, for example, the differences in

GDM prevalence across populations, and the inverse correlation

with historical glycemic intake. When mothers born in energy-poor

environments emigrate to energy-rich environments, fetal program-

ming may contribute to the pattern because these women have

heightened risks of GDM and type 2 diabetes [102]. Maternal

epigenetic modifications could be the mechanism underlying this

programming to suit the early life environment. Another contributor

could be the differences in patterns of adipose storage across

populations – Asian women tend to have more central adiposity

than women in other populations, and this is thought to increase

insulin resistance [103]. However, this proximate cause of increased

GDM among Asians is not at odds with a history of natural selection

acting on the trait.

Distinguishing among these competing explanations for the

patterns we see could be a fruitful line of research. For example,

first, one could conduct association studies in diverse ethnic

populations to identify genetic loci linked to GDM risk. Second,

these loci associated with GDM could be analyzed for signatures of

recent selection to test whether selection has influenced GDM

incidence across populations. Finally, one could test whether

incidence of GDM among immigrants approaches that of the rest

of the population across generations. GDM is reduced for South

Asians born in the USA compared to first generation immigrants,

but it is still elevated above the level of European-Americans [19],

indicating fetal programming may explain a large fraction of

differences in GDM risk, but is probably not the only factor.

Similar approaches could be used to test hypotheses of selection

for resistance to preeclampsia, infectious disease, hypoxia and

other reproductive factors. In a broad sense, this will require a better

understanding of the axes of human variation – genetic and

phenotypic. Next-generation sequencing data from diverse popula-

tions of humans will contribute to this understanding. However, the

phenotypic data are equally critical. We need a clearer under-

standing of the susceptibility of pregnant women to infectious

diseases and metabolic diseases across populations, and how this is

mediated by nutritional status, UV irradiation, hypoxia, and other

external factors. Testing these hypotheses will be important both for

evolutionary genetics and for improving care for human health

across diverse ethnicities.

Review Trends in Genetics October 2013, Vol. 29, No. 10

587

Page 36: Trends in genetics_-_october_2013

compared to other populations, which could be a result ofthis selection on maternal metabolism to suit diet [41,42].

Similarly to GDM, preeclampsia has an incidence thatvaries across populations, and it appears to have aninverse relationship with the dietary risk factor of saltintake (Table 1) [43]. In a study of preeclampsia in NYC,preeclampsia rates were lower among immigrants fromEast Asia (1.4%), especially Japan (1.2%) and Taiwan(0.9%), and lowest in the world among Iranians (0.6%)[44], compared to an incidence of 3–5% of pregnancies inother developed countries [24]. Although these popula-tions are less obese than Americans, Japanese and Ira-nians have historically high salt intakes due toconsumption of coastal foods (Japan) and high soil salinity(Iran) [45–47].

High salt-consuming populations, such as Japanese andIranians, may have experienced strong selection to protectthem from the deadly threat of preeclampsia. Because theheritability of preeclampsia is 0.55 according to a study ona Swedish cohort [48], this provides variation for selectionto act upon. Populations consuming large amounts of saltshould experience strong selection against genetic riskfactors for preeclampsia in the absence of modern medicalsupport for premature deliveries. Supporting this, insen-sitivity to salt in the diet is common in Japanese: womenconsuming the most salt (20.6 g/day) have no more hyper-tension than those consuming the least (8 g/day) [49]. Bycomparison, the WHO recommends less than 5 g/day of saltconsumption for adults [50].

Adaptation for consuming a high glycemic, high dairydiet may have been the result of selection in Europeansthrough the pressure of GDM, whereas adaptation forconsuming a high salt diet may have evolved in Japaneseand Iranians through the selective pressure of preeclamp-sia. By contrast, alternative hypotheses may also explainthe trends described (see Box 2). In the past severalthousand years, populations migrated to new environ-ments and invented new methods of food extraction andprocessing, such as agriculture, pastoralism, and fishing.The hypotheses presented here focus on how selectivepressures during pregnancy may cause strong selectionin response to changing diets in recent human evolution.

Nutrients and selection during pregnancyAccess to nutrients has been critical in human evolution,contingent upon dietary resources and the physiologicalprocesses that determine the bioavailability of ingestednutrients. Two selective pressures in humans that changedthe amount and bioavailability of nutrients in the diet wereexposure to solar UV radiation and adult milk-drinking.The ways in which these impacted upon fecundity andpregnancy may explain why UV radiation and milk-drink-ing exerted such strong fitness effects.

Skin pigmentation closely correlates with UV radiationworldwide [51], perhaps partly because UV radiationexerted strong selection across populations during preg-nancy in addition to other stages of life. Lighter or darkerpigmentation impacts upon the absorption of UV radiationand thereby on folate and vitamin D3, critical micronu-trients during pregnancy [51,52]. Folate – obtained fromeating plants – is stored in cutaneous blood vessels and canbe destroyed by UV radiation [53]. Folate deficiency causesfailure of neural tubes to close during fetal development,resulting in anencephalus and spina bifida, defects lethalto the fetus [54]. Neural tube defects rarely occur in darklypigmented people because their melanin protects theirfolate stores in equatorial areas [51]. Therefore, increasedmelanin production among equatorial populations ofAfrica, as well as of Asia, Australia, and the Pacific wherepopulations migrated, was potentially selected to protectfolate stores in the skin during pregnancy.

By contrast, melanin in the skin also blocks synthesis ofvitamin D3 at higher latitudes [55]. Vitamin D3 enablesabsorption of calcium for skeletal formation in the fetusand maintenance in the mother [56]. Deficiencies causemalformation of the maternal pelvis, maternal osteoporo-sis, and rickets in fetuses and growing children [57,58]. Inaddition, vitamin D3 may assist development of the fetalinnate immune system and critical organs [59,60]. There-fore, balancing the synthesis of vitamin D3 with protectionof folate stores for pregnancy probably played a role in thestrong selection for graded melanation with UV-radiationclines worldwide [51,52].

Signatures of strong selection have been found surround-ing genes with variants associated with skin pigmentation

Table 1. Relationship between metabolic diseases of pregnancy and traditional diets

GDM incidence, glycemic index, and dairy consumption

Population GDM incidence Diet Dairy Agriculture Glycemic index Refs

European-Americans 3.6%a 70% Carbohydrate; grain-based Yes Yes High [19,34,35]

Hunter-gatherers ? 3–50% Carbohydrate; game, tubers, vegetables, fruits,

nuts, etc.

No No Moderate [34]

Bangladeshis 7–9%b

21.2%a

Rice, fish No Yes Moderate [19,36,37]

African-Americans 4.3%a Agriculture, pastoralism, or hunter-gatherer Mixed Mixed Moderate [19]

Preeclampsia incidence and traditional salt consumption

Population Preeclampsia incidence Salt consumption Obesity Refs

European-Americans 2%a ? High [44,45]

Sub-Saharans 3.3–3.9%a Low, especially in rainforests Low [44,45]

African-Americans 4.6%a Low, mixed ancestry High [44,45]

Iranians 0.6%a High, due to soil salinity Medium [44–46]

Japanese 1.2%a High, due to seafood Medium [44,45,47]

aIncidence for populations living in New York City.

bIncidence for populations living in Bangladesh.

Review Trends in Genetics October 2013, Vol. 29, No. 10

588

Page 37: Trends in genetics_-_october_2013

in diverse populations – notably SLC24A5, MATP, and TYRin Europeans, DCT, EGFR, and DRD2 in East Asians, andTYRP1, KITLG, ASIP, and OCA2 in both populations [61–64]. In addition, ancestral alleles of these genes that tend tobe associated with darker pigmentation, and that occur at ahigher frequency in Africans, also tend to be highly frequentin darkly pigmented Melanesian populations. This mayindicate convergent selection on the same genetic variantsin diverse populations [61], although many populationsremain to be tested.

Alternatively, UV radiation may have selected for ap-propriate skin pigmentation at other life stages such aschildhood. Some detrimental effects of UV radiation onskin, such as skin cancer, occur post-reproductively, miti-gating their importance to fitness [52,65]. However, sun-burn alone causes significant morbidity for lightlypigmented people living in high UV regions because itdamages the skin, increasing infection and water loss,and decreasing thermoregulatory control. Furthermore,although vitamin D3 is critical for pregnancy, it is alsoimportant for bone density, immune function, and othereffects in childhood and throughout life. To address this,one piece of evidence indicating that pregnancy, specifical-ly, may have been important to selection on skin pigmen-tation is that women exhibit slightly lower levels of skinpigmentation on low-exposure patches of skin than do men,across world populations, indicating that the need forvitamin D3 may have been more critical for women thanmen [51]. Research clarifying the importance of vitamin D3status to human health at different life-stages could shedmore light on this hypothesis.

Likewise, the ability to drink milk among pastoralistswho keep dairy animals may also have been driven byselection on reproductive fitness. These pastoralists ex-perienced strong selection in the past 10 000 years tocontinue digesting the lactose found in milk into adult-hood, instead of losing this ability shortly after birth asoccurs in most mammals [66]. Strong selection has beendetected for a number of different genetic polymorphismsin diverse pastoralist populations from Europe, Africa,the Middle East, and Central Asia, each associated withregulation of LCT expression, encoding the enzyme lac-tase, which is responsible for cleaving lactose, the disac-charide in milk [35,67–70]. Researchers have beensurprised by the strength of this selection and havestruggled to develop plausible explanations for it. Milkfrom animals provides an extra source of sugar, protein,fat, calcium, and hydration, beneficial not only for sur-vival but also for reproduction.

Several possible hypotheses could link milk to repro-ductive fitness. First, milk from animals provided a sterilesource of hydration, especially for those living in hot, aridclimates such as Africa and the Middle East [66]. Consid-ering the sensitivity of pregnant women to contaminatedfood and drink [71], pregnant women able to drink sterilefresh milk may have experienced special fitness benefits.Second, the extra calcium in milk could be beneficial due toits role in skeletal development and maintenance and tofemale reproductive maturation because large pelvises arerequired for vaginal delivery [72]. Third, because fat ismore calorie-dense than proteins and carbohydrates, fat

from milk could help the mother nourish her infant duringpregnancy and lactation. Fat stores and energy balancehave also been linked to age of menarche and length ofanovulatory period post-pregnancy [73,74].

A final hypothesis involves the fact that milk and otheranimal fats contain cholesterols used to synthesize repro-ductive hormones, critical for fecundity and early fetaldevelopment and growth [75]. The grain-based diets ofNeolithic farmers were lower in cholesterol than the dietsof hunter-gatherer ancestors who consumed more wildgame [34]. Less cholesterol in the diet correlates with lowerlevels of reproductive steroids [76], reducing ovarian func-tion and fecundity, suggesting that milk drinking couldhave provided a much-needed cholesterol and fertilityboost for Neolithic Europeans. Therefore, the increase infat, cholesterol, and calcium from drinking milk may haveaccelerated female skeletal maturation, increased caloricresources, and increased fecundity among women whocould consume dairy, creating strong fitness benefits.

Infectious disease and selection during pregnancyInfectious diseases have exerted some of the strongestforces of selection on humans, most notably since theincrease in population densities following the transitionto agriculture and pastoralism 10 000 years ago. Forexample, genetic variants conferring resistance to malaria,such as alleles in the regions of HBB, HBA, FY, CD36,G6PD, were strongly selected among African populationsand others where malaria is endemic [77]. Though infec-tious diseases are threats to survival generally, theirdifferential impact on infants and pregnant women makesthem especially powerful selective agents.

During pregnancy the maternal immune system is sup-pressed so that the mother does not launch an adaptiveimmune response to the foreign cellular antigens of the fetus[9]. Although details are still being clarified, this responsemay make pregnant women less able to clear infectionsrequiring strong inflammatory responses [9]. The outcomeis that pregnant women experience spontaneous abortionand have higher morbidity and mortality in response tomany infections than the general population [9].

Malaria, influenza, and cholera are three infectiousdiseases that pose severe risks for pregnancy. In particu-lar, African Plasmodium falciparum can infect the placen-ta [9]. As a result, pregnant women with malaria die two- tothreefold more often than the general infected population[78]. In sub-Saharan Africa malaria causes 20% of thecases of low infant birthweight, together with slow growth,spontaneous abortion, maternal anemia, and infant mor-tality [9,78,79]. Intriguingly, positive selection on a geneticvariant of the gene FLT1, which reduces spontaneousabortions in cases of placental malaria, has been foundfor a malaria-endemic population in Tanzania [80]. Thisindicates that, in the case of malaria resistance, selectionmediated by pregnant women and their fetuses alone issufficient for adaptive change in allele frequency in apopulation. Based upon this evidence, although geneticvariants conferring general resistance to malaria experi-enced positive selection that could have been mediated by abroader subset of the population, pregnant women likelycomprised an important portion of this selection.

Review Trends in Genetics October 2013, Vol. 29, No. 10

589

Page 38: Trends in genetics_-_october_2013

During the 1918 influenza pandemic �50% of allinfected pregnant women contracted pneumonia and�50% of this subset died (�27% total mortality for infectedpregnant women), far more than the �1% mortality for allindividuals of reproductive age with influenza [81,82].Together with fetal abortion, this caused a 5–15% dropin birth rate the following spring [83]. This pattern istypical of other influenza pandemics [84]. Mortality byinfluenza is heritable [85], and therefore resistance toinfluenza may have been strongly selected for in recenthuman evolution, although this has been understudied.

Cholera causes diarrhea, vomiting, dehydration, andcramping, which can induce spontaneous abortion, pre-term small-birthweight babies, and maternal death [86].Similarly to influenza, smallpox, and dysentery, choleradecreases birthrates significantly during epidemic years[10,87], indicating it has strong potential as a selectiveagent in humans.

Many other infectious diseases are particularly danger-ous for pregnant women. Among female Lassa feverpatients of childbearing years admitted to a hospital inSierra Leone, death was significantly higher for pregnantwomen (25%) than non-pregnant women (13%) [88]. Tell-ingly, symptoms improved with delivery [88]. The Ebolavirus killed more pregnant patients (95.5%) than the pop-ulation average (77%) during an outbreak in the Demo-cratic Republic of the Congo [89]. Some infectious agents,for example the parasite Toxoplasma gondii, cause diseaseonly in pregnant women, who are likely to experienceabortion [9]. Evidence from mice suggests that anotherparasite, Leishmania, also exploits immunological changesin pregnant women [90]. Finally, Varicella zoster, thechickenpox virus, causes pregnant women to develop moreskin lesions and pneumonia at higher rates than theaverage adult with chicken pox [91].

Pregnant women are clearly especially vulnerable toinfectious disease. Although many of these diseases alsocause significant morbidity in non-pregnant adults, thedramatic impact on pregnant women makes it likely thatselective effects would have been strongly mediated by thispopulation, though the adaptive benefit of genetic resis-tance to infectious disease is felt across all life-stages forboth males and females. As researchers discover functionalgenetic variants in areas under selection in the humangenome, we predict that many are likely to confer resis-tance to infectious diseases that severely impact uponpregnant women who lack resistance in addition to causinghigh infant mortality.

Concluding remarksThe field of human evolutionary genomics is in a period oftransition. Currently, only a few examples of selection inresponse to environmental pressures felt by particularpopulations have been elucidated – such as malaria resis-tance and lactase persistence. These examples were al-ready under study before the development of evolutionarygenomics, and the signatures of selection surrounding thegenetic variants under selection merely served to substan-tiate strong adaptive hypotheses already presented. How-ever, next-generation sequencing data, conducted indiverse populations, now provides the raw material to

detect many more strong candidates for selection. Thus,the field of evolutionary genomics now has the potential toprovide many new testable hypotheses of selection, whichwere not developed a priori. For example, a catalog ofcandidate variants for selection was recently published,and one of these variants was experimentally character-ized [92].

At this turning point in the field we seek to underscorethat many aspects of human evolution are best understoodby investigating the life-history bottleneck of pregnancyand birth from the perspective of both the mother and theinfant. During pregnancy, nutritional, energetic, physical,and immunological requirements are constrained in themother to support the fetus, concentrating selective forcesupon the mother at a sensitive life-stage. The pressuresthat have been most important in recent human evolution– infectious diseases from high population densities, adultdairy consumption from pastoralism, grain consumptionfrom agriculture, and changes in UV radiation and oxygenlevels from moving to extreme latitudes and altitudes –have left genetic signatures of their selective impact. Al-though these selective factors may be felt across the life-span, nowhere are they more serious than during infancyand pregnancy. We should thus remain cognizant of thesephases of life because next-generation sequencing nowprovides evolutionary genomicists with the data to gener-ate many new testable hypotheses of why particular lociare under selection in humans.

AcknowledgmentsWe thank Katie Hinde for comments on the manuscript and helpfuldiscussions. We also thank the Packard Foundation for their support.

References1 Graunt, J. (1662) Natural and Political Observations Mentioned in a

Following Index, and Made Upon the Bills of Mortality, Royal Society ofLondon

2 Butte, N.F. (2000) Carbohydrate and lipid metabolism in pregnancy:normal compared with gestational diabetes mellitus. Am. J. Clin. Nutr.71, 1256S–1261S

3 Pritchard, J. (1965) Changes in the blood volume during pregnancy anddelivery. Anesthesiology 26, 393–399

4 Kaufmann, P. et al. (2004) Aspects of human fetoplacentalvasculogenesis and angiogenesis. II. Changes during normalpregnancy. Placenta 25, 114–126

5 Sladek, S.M. et al. (1997) Nitric oxide and pregnancy. Am. J. Physiol.272, R441–R463

6 Hermida, R.C. et al. (2000) Blood pressure patterns in normalpregnancy, gestational hypertension, and preeclampsia. Hypertension36, 149–158

7 Jolly, M.C. et al. (2003) Risk factors for macrosomia and its clinicalconsequences: a study of 350,311 pregnancies. Eur. J. Obstet. Gynecol.Reprod. Biol. 111, 9–14

8 James, A.H. et al. (2005) Incidence and risk factors for stroke inpregnancy and the puerperium. Obstet. Gynecol. 106, 509–516

9 Robinson, D.P. and Klein, S.L. (2012) Pregnancy and pregnancy-associated hormones alter immune responses and diseasepathogenesis. Horm. Behav. 62, 263–271

10 Woods, R. (2009) Death before Birth, Oxford University Press11 Marlowe, F.W. (2005) Hunter-gatherers and human evolution. Evol.

Anthropol. 14, 54–6712 Rajaratnam, J.K. et al. (2010) Neonatal, postneonatal, childhood, and

under-5 mortality for 187 countries, 1970–2010: a systematic analysisof progress towards Millennium Development Goal 4. Lancet 375,1988–2008

13 Ronsmans, C. and Graham, W.J. (2006) Maternal mortality: who,when, where, and why. Lancet 368, 1189–1200

Review Trends in Genetics October 2013, Vol. 29, No. 10

590

Page 39: Trends in genetics_-_october_2013

14 Courtiol, A. et al. (2012) Natural and sexual selection in a monogamoushistorical human population. Proc. Natl. Acad. Sci. U.S.A. 109, 8044–8049

15 Liu, J. et al. (2012) Maternal risk of breeding failure remained lowthroughout the demographic transitions in fertility and age at firstreproduction in Finland. PLoS ONE 7, e34898

16 Laland, K.N. et al. (2010) How culture shaped the human genome:bringing genetics and the human sciences together. Nat. Rev. Genet. 11,137–148

17 Moreau, C. et al. (2011) Deep human genealogies reveal a selectiveadvantage to be on an expanding wave front. Science 334, 1148–1150

18 Barbour, L.A. et al. (2007) Cellular mechanisms for insulin resistancein normal pregnancy and gestational diabetes. Diabetes Care 30(Suppl. 2), S112–S119

19 Savitz, D.A. et al. (2008) Ethnicity and gestational diabetes in NewYork City, 1995–2003. BJOG 115, 969–978

20 Langer, O. et al. (2005) Gestational diabetes: the consequences of nottreating. Am. J. Obstet. Gynecol. 192, 989–997

21 Sermer, M. et al. (1998) The Toronto Tri-Hospital Gestational DiabetesProject. A preliminary review. Diabetes Care 21 (Suppl. 2), B33–B42

22 Rosenberg, K. and Trevathan, W. (2002) Birth, obstetrics and humanevolution. BJOG 109, 1199–1206

23 Dunsworth, H.M. et al. (2012) Metabolic hypothesis for humanaltriciality. Proc. Natl. Acad. Sci. U.S.A. 109, 15212–15216

24 WHO (2005) World Health Report: Make Every Mother and ChildCount, World Health Organization

25 Moodley, J. (2008) Maternal deaths due to hypertensive disorders inpregnancy. Best Pract. Res. Clin. Obstet. Gynaecol. 22, 559–567

26 Duley, L. (1992) Maternal mortality associated with hypertensivedisorders of pregnancy in Africa, Asia, Latin America and theCaribbean. Br. J. Obstet. Gynaecol. 99, 547–553

27 Hunsberger, M. et al. (2010) Racial/ethnic disparities in gestationaldiabetes mellitus: findings from a population-based survey. WomensHealth Issues 20, 323–328

28 Caughey, A.B. et al. (2010) Maternal and paternal race/ethnicity areboth associated with gestational diabetes. Am. J. Obstet. Gynecol. 202,616.e1–5

29 Holt, S. et al. (1997) An insulin index of foods: the insulin demandgenerated by 1000-kJ portions of common foods. Am. J. Clin. Nutr. 66,1264–1276

30 Hoyt, G. et al. (2007) Dissociation of the glycaemic and insulinaemicresponses to whole and skimmed milk. Br. J. Nutr. 93, 175

31 Zhang, C. et al. (2006) Dietary fiber intake, dietary glycemic load, and therisk for gestational diabetes mellitus. Diabetes Care 29, 2223–2230

32 Zhang, C. and Ning, Y. (2011) Effect of dietary and lifestyle factors onthe risk of gestational diabetes: review of epidemiologic evidence. Am.J. Clin. Nutr. 94, 1975S–1979S

33 Hoppe, C. et al. (2005) High intakes of milk, but not meat, increase s-insulin and insulin resistance in 8-year-old boys. Eur. J. Clin. Nutr. 59,393–398

34 Strohle, A. and Hahn, A. (2011) Diets of modern hunter-gatherers varysubstantially in their carbohydrate content depending onecoenvironments: results from an ethnographic analysis. Nutr. Res.31, 429–435

35 Myles, S. et al. (2005) Genetic evidence in support of a sharedEurasian–North African dairying origin. Hum. Genet. 117, 34–42

36 Itan, Y. et al. (2010) A worldwide correlation of lactase persistencephenotype and genotypes. BMC Evol. Biol. 10, 36–47

37 Atkinson, F.S. et al. (2008) International tables of glycemic index andglycemic load values: 2008. Diabetes Care 31, 2281–2283

38 Almgren, P. et al. (2011) Heritability and familiality of type 2 diabetesand related quantitative traits in the Botnia Study. Diabetologia 54,2811–2819

39 Metzger, B.E. et al. (2007) Summary and recommendations of the FifthInternational Workshop–Conference on Gestational Diabetes Mellitus.Diabetes Care 30 (Suppl. 2), S251–S260

40 Bellamy, L. et al. (2009) Type 2 diabetes mellitus after gestationaldiabetes: a systematic review and meta-analysis. Lancet 373,1773–1779

41 Dickinson, S. et al. (2002) Postprandial hyperglycemia and insulinsensitivity differ among lean young adults of different ethnicities.J. Nutr. 2574–2579

42 Henry, C.J.K. et al. (2008) Glycaemic index of common foods tested inthe UK and India. Br. J. Nutr. 99, 840–845

43 Reyes, L. et al. (2012) Nutritional status among women with pre-eclampsia and healthy pregnant and non-pregnant women in aLatin American country. J. Obstet. Gynaecol. Res. 38, 498–504

44 Gong, J. et al. (2012) Maternal ethnicity and pre-eclampsia in NewYork City, 1995–2003. Paediatr. Perinat. Epidemiol. 26, 45–52

45 Intersalt Cooperative Research Group (1988) Intersalt: aninternational study of electrolyte excretion and blood pressure.Results for 24 hour urinary sodium and potassium excretion. BMJ297, 319–328

46 FAO/IIASA/ISRIC/ISS-CAS/JRC (2012) Harmonized World SoilDatabase, Food and Agriculture Organization of the United Nationsand International Institute for Applied Systems Analysis (Version 1.2)

47 Brown, I.J. et al. (2009) Salt intakes around the world: implications forpublic health. Int. J. Epidemiol. 38, 791–813

48 Cnattingius, S. et al. (2004) Maternal and fetal genetic factors accountfor most of familial aggregation of preeclampsia: a population-basedSwedish cohort study. Am. J. Med. Genet. 130A, 365–371

49 Miura, K. et al. (2010) Dietary salt intake and blood pressure in arepresentative Japanese Population: baseline analyses of NIPPONDATA80. J. Epidemiol. 20, S524–S530

50 WHO (2010) Global Status Report on Non-Communicable Diseases2010, World Health Organization

51 Jablonski, N.G. and Chaplin, G. (2000) The evolution of human skincoloration. J. Hum. Evol. 39, 57–106

52 Jablonski, N.G. and Chaplin, G. (2010) Colloquium paper: human skinpigmentation as an adaptation to UV radiation. Proc. Natl. Acad. Sci.U.S.A. 107 (Suppl. 2), 8962–8968

53 Steindal, A.H. et al. (2008) 5-Methyltetrahydrofolate is photosensitivein the presence of riboflavin. Photochem. Photobiol. Sci. 7, 814

54 Fleming, A. and Copp, A.J. (1998) Embryonic folate metabolism andmouse neural tube defects. Science 280, 2107–2109

55 Holick, M.F. (1987) Photosynthesis of vitamin D in the skin: effect ofenvironmental and life-style variables. Fed. Proc. 46, 1876–1882

56 Brunvand, L. et al. (1996) Vitamin D deficiency and fetal growth. EarlyHum. Dev. 45, 27–33

57 Fogelman, Y. et al. (1995) High prevalence of vitamin D deficiencyamong Ethiopian women immigrants to Israel: exacerbation duringpregnancy and lactation. Isr. J. Med. Sci. 31, 221–224

58 Henderson, J.B. et al. (1987) The importance of limited exposure toultraviolet radiation and dietary factors in the aetiology of Asianrickets: a risk-factor model. Q. J. Med. 63, 413–425

59 Norman, A.W. (2008) From vitamin D to hormone D: fundamentals ofthe vitamin D endocrine system essential for good health. Am. J. Clin.Nutr. 88, 491S–499S

60 Holick, M.F. (2004) Vitamin D: importance in the prevention of cancers,type 1 diabetes, heart disease, and osteoporosis. Am. J. Clin. Nutr. 79,362–371

61 Lao, O. et al. (2007) Signatures of positive selection in genes associatedwith human skin pigmentation as revealed from analyses of singlenucleotide polymorphisms. Ann. Hum. Genet. 71, 354–369

62 Norton, H.L. et al. (2007) Genetic evidence for the convergent evolutionof light skin in Europeans and East Asians. Mol. Biol. Evol. 24, 710–722

63 Alonso, S. et al. (2008) Complex signatures of selection for themelanogenic loci TYR, TYRP1 and DCT in humans. BMC Evol. Biol.8, 74

64 Quillen, E.E. et al. (2012) OPRM1 and EGFR contribute to skinpigmentation differences between Indigenous Americans andEuropeans. Hum. Genet. 131, 1073–1080

65 Blum, H. (1961) Does the melanin pigment of human skin haveadaptive value? An essay in human ecology and the evolution ofrace. Q. Rev. Biol. 36, 50–63

66 Ingram, C.J.E. et al. (2009) Lactose digestion and the evolutionarygenetics of lactase persistence. Hum. Genet. 124, 579–591

67 Tishkoff, S.A. et al. (2007) Convergent adaptation of human lactasepersistence in Africa and Europe. Nat. Genet. 39, 31–40

68 Enattah, N.S. et al. (2008) Independent introduction of two lactase-persistence alleles into human populations reflects different history ofadaptation to milk culture. J. Hum. Genet. 82, 57–72

69 Peng, M-S. et al. (2012) Lactase persistence may have an independentorigin in Tibetan populations from Tibet, China. J. Hum. Genet. 57,394–397

Review Trends in Genetics October 2013, Vol. 29, No. 10

591

Page 40: Trends in genetics_-_october_2013

70 Heyer, E. et al. (2011) Lactase persistence in central Asia: phenotype,genotype, and evolution. Hum. Biol. 83, 379–392

71 Pouillot, R. et al. (2012) Relative risk of listeriosis in Foodborne DiseasesActive Surveillance Network (FoodNet) sites according to age,pregnancy, and ethnicity. Clin. Infect. Dis. 54 (Suppl. 5), S405–S410

72 Ellison, P.T. (1990) Human ovarian function and reproductiveecology: new hypotheses. Am. Anthropol. 92, 933–952

73 Frisch, R.E. (1984) Body fat, puberty and fertility. Biol. Rev. Camb.Philos. Soc. 59, 161–188

74 Panter-Brick, C. et al. (1993) Seasonality of reproductive function andweight loss in rural Nepali women. Hum. Reprod. 8, 684–690

75 Herrera, E. (2002) Lipid metabolism in pregnancy and itsconsequences in the fetus and newborn. Endocrine 19, 43–55

76 Goldin, B.R. et al. (1982) Estrogen excretion patterns and plasmalevels in vegetarian and omnivorous women. N. Engl. J. Med. 307,1542–1547

77 Campino, S. et al. (2006) Mendelian and complex genetics ofsusceptibility and resistance to parasitic infections. Semin.Immunol. 18, 411–422

78 Shulman, C. (2003) Importance and prevention of malaria inpregnancy. Trans. R. Soc. Trop. Med. Hyg. 97, 30–35

79 Steketee, R.W. et al. (2001) The burden of malaria in pregnancy inmalaria-endemic areas. Am. J. Trop. Med. Hyg. 64, 28–35

80 Muehlenbachs, A. et al. (2008) Natural selection of FLT1 alleles andtheir association with malaria resistance in utero. Proc. Natl. Acad.Sci. U.S.A. 105, 14488–14491

81 Harris, J. (1919) Influenza occurring in pregnant women. A statisticalstudy of thirteen hundred and fifty cases. J. Am. Med. Assoc. 72, 978–980

82 Taubenberger, J.K. and Morens, D.M. (2006) 1918 Influenza: themother of all pandemics. Emerg. Infect. Dis. 12, 15–22

83 Bloom-Feshbach, K. et al. (2011) Natality decline and miscarriagesassociated with the 1918 influenza pandemic: the Scandinavian andUnited States experiences. J. Infect. Dis. 204, 1157–1164

84 Pazos, M. et al. (2012) The influence of pregnancy on systemicimmunity. Immunol. Res. 54, 254–261

85 Horby, P. et al. (2012) The role of host genetics in susceptibility toinfluenza: a systematic review. PLoS ONE 7, e33180

86 Carrera, J. (ed.) (2007) Recommendations and Guidelines forPerinatal Medicine, Matres Mundi International

87 Hotelling, H. and Hotelling, F. (1931) Causes of birth ratefluctuations. J. Am. Stat. Assoc. 26, 135–149

88 Price, M.E. et al. (1988) A prospective study of maternal and fetaloutcome in acute Lassa fever infection during pregnancy. BMJ 297,584–587

89 Mupapa, K. et al. (1999) Ebola hemorrhagic fever and pregnancy. J.Infect. Dis. 179 (Suppl. 1), S11–S12

90 Roberts, C. et al. (2001) Sex-associated hormones and immunity toprotozoan parasites. Clin. Microbiol. Rev. 14, 476–488

91 Harger, J.H. et al. (2002) Risk factors and outcome of varicella–zostervirus pneumonia in pregnant women. J. Infect. Dis. 185, 422–427

92 Grossman, S.R. et al. (2013) Identifying recent adaptations in large-scale genomic data. Cell 152, 703–713

93 Heron, M. (2012) Deaths: leading causes for 2009. Natl. Vital Stat.Rep. 61, 1–95

94 Scheinfeldt, L.B. et al. (2012) Genetic adaptation to high altitude inthe Ethiopian highlands. Genome Biol. 13, R1

95 Bigham, A. et al. (2010) Identifying signatures of natural selection inTibetan and Andean populations using dense genome scan data. PLoSGenet. 6, e1001116

96 Beall, C.M. et al. (2010) Natural selection on EPAS1 (HIF2a)associated with low hemoglobin concentration in Tibetanhighlanders. Proc. Natl. Acad. Sci. U.S.A. 107, 11459–11464

97 Simonson, T.S. et al. (2010) Genetic evidence for high-altitudeadaptation in Tibet. Science 329, 72–75

98 Mejıa, O.M. et al. (2005) Genetic association analysis of chronicmountain sickness in an Andean high-altitude population.Haematologica 90, 13–19

99 Moore, L.G. et al. (2001) Oxygen transport in tibetan women duringpregnancy at 3,658 m. Am. J. Phys. Anthropol. 114, 42–53

100 Wilson, M.J. et al. (2007) Greater uterine artery blood flow duringpregnancy in multigenerational (Andean) than shorter-term(European) high-altitude residents. Am. J. Physiol. Regul. Integr.Comp. Physiol. 293, R1313–R1324

101 Beall, C.M. (2007) Two routes to functional adaptation: Tibetan andAndean high-altitude natives. Proc. Natl. Acad. Sci. U.S.A. 104(Suppl. 1), 8655–8660

102 Hales, C.N. and Barker, D.J. (2001) The thrifty phenotype hypothesis.Br. Med. Bull. 60, 5–20

103 Raji, A. et al. (2001) Body fat distribution and insulin resistance inhealthy Asian Indians and Caucasians. J. Clin. Endocrinol. Metab.86, 5366–5371

Review Trends in Genetics October 2013, Vol. 29, No. 10

592

Page 41: Trends in genetics_-_october_2013

Finding the lost treasures in exomesequencing dataDavid C. Samuels1*, Leng Han2*, Jiang Li3, Sheng Quanghu3, Travis A. Clark4,Yu Shyr3, and Yan Guo3

1 Center for Human Genetics Research, Vanderbilt University, Nashville, TN, 37232, USA2 Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, TX, 77030, USA3 Center for Quantitative Sciences, Vanderbilt University, Nashville, TN, 37232, USA4 Vanderbilt Technology for Advanced Genomics, Vanderbilt University, Nashville, TN, 37232, USA

Exome sequencing is one of the most cost-efficientsequencing approaches for conducting genome researchon coding regions. However, significant portions of thereads obtained in exome sequencing come from outsideof the designed target regions. These additional readsare generally ignored, potentially wasting an importantsource of genomic data. There are three major types ofunintentionally sequenced read that can be found inexome sequencing data: reads in introns and intergenicregions, reads in the mitochondrial genome, and readsoriginating in viral genomes. All of these can be used forreliable data mining, extending the utility of exomesequencing. Large-scale exome sequencing data reposi-tories, such as The Cancer Genome Atlas (TCGA), the1000 Genomes Project, National Heart, Lung, and BloodInstitute (NHLBI) Exome Sequencing Project, and TheSequence Reads Archive, provide researchers with ex-cellent secondary data-mining opportunities to studygenomic data beyond the intended target regions.

The rise of exome sequencingNext-generation sequencing (see Glossary) has substan-tially decreased the cost of sequencing and has become thetool of choice for genomic studies. One of the most popularnew sequencing approaches is exome sequencing(Figure 1), in which the coding regions of the full genomeare targeted, captured, and sequenced. The exome repre-sents approximately 1–1.5% of the human genome withapproximately 50 million bp, but it accounts for over 85% ofall mutations that have been identified in Mendelian dis-orders [1]. As a result, exome sequencing is currently anattractive and practical approach for the investigation ofcoding variations [2,3].

Targeted resequencing enables the enrichment ofspecific sequences from a whole-genomic library. Exome

sequencing is an example of this approach, whereby thecomplete coding region of the genome is enriched forsequencing. However, many of the captured DNA frag-ments still derive from outside the targeted regions(Figure 2). As a result, intronic and intergenic regionsmay be sequenced, including promoters, conserved noncod-ing sequences, untranslated regions (UTR), miRNA targetsites, and other potentially functional regions. In a typicalexome-sequencing study, approximately 40–60% of thereads are off target [4–6] and all or most of these off-targetreads are usually ignored. This practice does not utilize thefull potential of exome-sequencing data, because it over-looks a large amount of potentially useful data. Recentstudies [5,7–10] have shown that off-target reads can be ofgood quality and can provide useful insights.

Reads aligned outside the target regionsThere are three major exome-sequencing capture kitscurrently in broad use: Illumina TruSeq, Agilent SureSelect,and NimbleGen SeqCap EZ. All three platforms startwith whole-genomic libraries made from fragmentedgenomic DNA and use biotinylated oligonucleotide baits

Review

Glossary

Bait: the hybridization probe designed to capture effectively the coordinates of

the target region to be sequenced. The bait design differs by manufacturer and

method. Some methods use baits that tile the target region, whereas others

use baits that do not overlap and differ in distance between the baits.

Exome sequencing: selectively capturing the exome (coding regions) and other

content in a whole-genome library before sequencing. This enables deeper

coverage of the genomic region that is enriched in disease-causing variants in

a megabase-sized DNA library instead of sequencing a lower coverage

gigabase-sized whole-genome library.

Next-generation sequencing: high-throughput DNA sequencing using mas-

sively parallel reactions generating millions of independent reads. The

methodology employs a variety of technologies, including highly parallelized

pyrosequencing, sequencing-by-synthesis, sequencing by ligation, and single

molecule sequencing methods.

Off-target reads: the sequencing reads that do not align to the target region.

Oncogenic virus: a virus associated with cancer. The cause of this association

is generally due to the insertion of the viral genome into the host genome in a

location that disrupts a crucial host gene, leading to the expansion of that cell

into a tumor.

Reads: the fragments of DNA sequences generated that represent data from a

unique fragment of the sequencing library. A typical next-generation sequen-

cing run generates millions of reads per sample.

Target regions: the region of interest defined for enrichment. The genomic

coordinates of a target region are used to design the capture baits, probes, or

primers for enrichment and vary by exome sequencing kit.

Unmappable reads: the reads that are not aligned to the human genome.

0168-9525/$ – see front matter

� 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.07.006

Corresponding author: Guo, Y. ([email protected]).Keywords: mitochondria; exome capture; virus; virus integration; mtDNA copynumber; unmapped read.

* These authors contributed equally to this article.

Trends in Genetics, October 2013, Vol. 29, No. 10 593

Page 42: Trends in genetics_-_october_2013

complementary to the design targets to enrich for exons andother vendor-specific content. The target regions for thesethree exome capture kits vary and range from 37.6 to 62.1million bp. The capture kits can enrich just the exome, exonsplus 30 and 50UTRs, and other content. The kits also differ intheir target regions, bait length, bait density, and the mole-cule used for capturing.

Other capture techniques, including array-based, mul-tiplex PCR, selector-probe (HaloPlex), and molecular-in-version probe (MIPs), methods are also available. Thecapture efficiency varies by capture method. For example,one group [11] (using the NimbleGen 2.1M array-based

capture kit) reported having 64.5% of sequenced basesoutside the target regions and 31.9% of the reads morethan 500 bp away from the target regions; another group[1] (using Agilent 244K microarrays for target enrichment)reported over 50% of sequenced bases outside the targetregions. The capture efficiency of the three major exomecapture kits has been reported by multiple studies. ForAgilent SureSelect, the capture efficiency is between 42%and 58% [4–6]; for Illumina TrueSeq, it is between 45% and46% [5]; and for NimbleGen SeqCap EZ it is between 50%and 53% [4,6]. Although a capture efficiency of less than50% can be misinterpreted as failure of the sequencingmethod, the raw number of reads mapped to the targetregions and the median depth of the target regions aremore informative parameters to measure the success of thecapture method. The unmapped fraction of reads can beanywhere from 5% to 19% [5] and it is related to manyfactors, such as the type of capture kit, DNA quality,aligner settings, and the completeness of the referencesequence used for the alignment. There is also variabilityintroduced during library preparation and sequencing.Even repeat sequencing of a sample can generate differentmetrics of capture efficiencies [6].

SNPs outside the exonic regionsMany functional elements are located outside the exonicregions [12–15]. Although the role of introns was unclearfor many years, several studies have now established somefunctional significance for introns [11,16–19]. For example,a study [20] identified two mutations within the corepromoter of the telomerase reverse transcriptase in 50 ofthe 70 melanomas examined. Intergenic regions comprise

20090

500

1000

Cum

ula�

ve P

UBM

EDpu

blic

a�on

s w

ith ‘e

xom

e’ 1500

2000

2010Year

201420132011 2012

TRENDS in Genetics

Figure 1. Results of a PUBMED search for papers using the term ‘exome’, through

1 July, 2013 showing the rapid and recent spread of this sequencing method.

Targeted?

Exome-sequencing reads

Mapped reads

Tar geted DN A

Unmapped DN Areads

Untargeted DN A

Viral DN ACon tamina�on

Intronic DN A Interg enic DN A mtDNA

Mapped?No

No

Yes

Yes

PathSeqVirusSeqVirusFusio nSeq

Any SNP Caller MitoSeek

TRENDS in Genetics

Figure 2. A flow diagram illustrating how off-target reads can be identified from exome-sequencing data. Currently available tools for the analysis of the different types of

off-target read are given. Abbreviation: SNP, single nucleotide polymorphism.

Review Trends in Genetics October 2013, Vol. 29, No. 10

594

Page 43: Trends in genetics_-_october_2013

approximately 70% of the human genome. A previousstudy [5] showed that approximately 50% of the identifiedsingle nucleotide polymorphisms (SNPs) from exome se-quencing were in the intended target regions, that 27% ofthe SNPs identified were in the flanking regions (within200 bp) of the target regions, and that the remaining 24%of the SNPs were in regions >200 bp away from the targetregions.

Although exome sequencing is not designed to identifyregulatory SNPs in intronic and intergenic regions, off-target reads from this type of experiment should not bediscarded a priori. One of the best examples of the useful-ness of these off-target data is a study of Tibetans in highaltitude [11], which found a pair of intronic SNPs inendothelial PAS domain protein 1 (EPAS1) with the great-est Tibetan-Han frequency difference. The authors specifi-cally noted that these SNPs were outside the intendedtarget regions of the exome sequencing, drawing attentionto the potential value of these reads.

These and other studies demonstrate that reliable SNPscan be identified through off-target reads captured byexome sequencing [5], suggesting that it is worth searchingfor such SNPs even though the experiment was notdesigned to find them. However, it has been observed thatthe SNP false positive rate increases as the reads alignfurther away from the captured regions [5]. Thus, morestringent filter criteria, such as depth and genotypingquality score, need to be applied for the SNPs outsidethe captured regions to achieve the same quality as SNPsinside the captured regions, due to the higher error rateassociated with off-target reads. For example, the transi-tion:transversion ratio is commonly used as a qualitymeasurement for SNPs identified through exome sequenc-ing [5,21,22]. To achieve the same transition:transversionratio for SNPs outside target regions when comparing withSNPs inside the target regions, stronger filters, such ashigher depth, are required [5]. Another artifact of exomesequencing is the pseudogene effect, where some intergenicregions are sequenced to abnormally high depth (>1000).This anomaly seems to be consistent regardless of the typeof capture kit used [5]. It has been speculated that suchphenomena are caused by homologies of pseudogenes. Themost commonly used SNP detection framework, GenomeAnalysis Tool Kit (GATK) [22], developed by the BroadInstitute, suggests that SNPs in such regions should beignored.

The mitochondrial genome in exome sequencingMitochondria have an important role in cellular energymetabolism, free radical generation, and apoptosis [23,24].mtDNA is a maternally inherited 16 569-bp closed-circlegenome that encodes two rRNAs, 22 tRNAs, and ten poly-peptides. Dysfunctions in mitochondrial function are animportant cause of many neurological diseases [25] anddrug toxicities [26,27], and may contribute to carcinogene-sis and tumor progression [28,29]. Furthermore, the mito-chondrial genome is a fundamental tool for humanpopulation genetics and has had a critical role in mappingthe migration of humanity across the globe [30–33].

Because the mitochondrial genome is almost all codingsequence, it fits every reasonable definition of the exome.

However, mtDNA is not targeted in any of the currentlyused exome-sequencing methods. Instead, mtDNA se-quence can be extracted from exome-sequencing data[2,10]. The average coverage of the mitochondrial genomefrom exome sequencing is approximately 100, easily sur-passing the average coverage of even the targeted genomicregions [10]. The relatively high coverage of mtDNA is dueto the high copy number of mtDNA per cell, on the order ofhundreds to several hundred thousand copies per cell,depending on the tissue type [34]. This should be con-trasted to techniques that specifically target the mitochon-drial sequence, which can produce an average depth of tensof thousands of reads across the mitochondrial genome[35–38]. Given that cells typically contain a very largenumber of copies of mtDNA, mixtures of wild type andmutant mtDNA (heteroplasmy) can range almost continu-ously from 0 to 100%. Pathogenic mtDNA mutations aretypically heteroplasmic in an individual, with asymptom-atic carriers of the mutations having a low heteroplasmylevel of the pathogenic mutation [39]. An average readdepth of only approximately 100 means that, althoughpolymorphisms can be accurately determined, the identifi-cation of heteroplasmic mtDNA variations is limited tothose present in >10% of the mtDNA molecules in thesample. However, these are likely to be the most clinicallyrelevant cases, again pointing to the potential utility ofanalyzing these sequences. Researchers have started toinfer mitochondria mutation information from exome-se-quencing data. The best examples are The Cancer GenomeAtlas (TCGA) project, where all mtDNA somatic mutationswere inferred from exome-sequencing data. For example,the current somatic mutation results for breast cancer inTCGA [40] contain exome-sequencing data from 776tumors and report 325 mtDNA somatic mutations derivedfrom off-target reads from the exome-sequencing data.

An important complication in aligning DNA reads to themitochondrial genome is the presence of nuclear copies ofthe mitochondrial genomes (nuMTS) [41,42]. nuMTS cancause ambiguity about whether a read should map to thenuclear or the mitochondrial genome. The simplest way toobtain the mitochondrial genome is to align the raw readsagainst the mitochondrial reference genome directly andthen filter out the nonaligned reads, thus ignoring thenuMTS. The disadvantage of this approach is that thereads that do derive from the nuMTS may introduce falseheteroplasmic variability in the mtDNA sequence. A mid-dle approach is to align the reads against both the nuclearand mitochondrial genomes simultaneously. When a readhas multiple locations to which it may be mapped, alignerssuch as BWA [43] will randomly choose among the possiblelocations. This has the disadvantage of treating the nuMTsand the mitochondrial genome equally, ignoring the verylarge copy number difference. The effect of this choice willbe that many of the reads coming from the mtDNA will befalsely aligned to the nuMTS, causing an artificially highcoverage of the nuMTS and an artificially low coverage ofthe mtDNA. A third choice gives precedence to the nuMTsby first aligning reads against the nuclear genome andthen aligning only the nonaligned reads to the mitochon-drial genome. This approach will have the most extrememisalignment of true mtDNA reads to the nuclear DNA

Review Trends in Genetics October 2013, Vol. 29, No. 10

595

Page 44: Trends in genetics_-_october_2013

(potentially leading to false SNP calls in the nuclear DNA),which will lower the coverage of the mitochondria genomeand decrease the chance of detecting true variants. Thethird approach is also the most conservative and timeconsuming, involving two alignment processes and leavingno chance of misaligning any nuMTS reads to the mito-chondria genome. The second approach is the most bal-anced approach between time consumption andmisalignment rate and has been implemented in MitoSeek[44] which can be used to extract mitochondria mutationand heteroplasmy information from exome-sequencing da-ta.

mtDNA copy number is highly variable and has beensuggested to be associated with many diseases, includingcancer [45–48]. Thus, it is an important mitochondrialstatistic that can be derived from exome-sequencing data.Traditional methods for evaluating mtDNA copy numberinvolve quantitative (q)PCR [49]. A more recent methodhas been developed that relies upon a sequencing-basedassay of mtDNA copy number that draws on the unbiasednature of next-generation sequencing and incorporatestechniques developed for RNA expression profiling [50].Although the authors claim that this assay reports abso-lute mitochondria copy number, we argue that the amountof library constructed will affect the copy number count.For example, it has been shown that the fraction of cap-tured mitochondrial sequences in exome-sequencing datais proportional to the relative abundance of the correspond-ing mitochondrial genome in the original total DNA extract[10]. Based on this observation, we conclude that relative,but not absolute, mtDNA copy number is detectablethrough exome-sequencing data. The mtDNA copyextracted from exome-sequencing data can be useful whenstudying tumor samples for conducting association testswith phenotypes such as tumor stage and metastasis stage.The recently developed software MitoSeek [44] also com-putes relative mtDNA copy number from exome-sequenc-ing data.

Pathogen DNA and integration sitesFinally, it is important to consider the portion of reads fromexome sequencing that does not map to the referencegenome. Some of these reads may represent viral DNA,as either free viral DNA or as viral genomes that have beenincorporated into the genome of a host. Detecting viralDNA is of particular importance due to the important roleof viral DNA integration into the host genome in initiatingcancer. Many viruses integrate into the genome of theirhost cells to replicate and, therefore, mutagenesis causedby viral infection may be quite common. Typically, virusestrigger tumor development by altering host genes or bysuppressing the immune system of the host, causing in-flammation over a long period of time. Most viruses lackclearly identifiable oncogenes capable of cellular transfor-mation and instead mediate oncogenic transformationthrough a process termed insertional mutagenesis (IM).The molecular mechanisms of viral IM can vary, but mostinvolve viral insertion within tumor suppressor genes orupregulation of cellular oncogenes in close proximity to thesite of viral integration via cis and trans effects of promoterand enhancer sequences within the viral long terminal

repeats (LTRs). Known oncogenic viruses [such as thehepatitis B virus (HBV) for liver cancer and the humanpapillomavirus (HPV) for head and neck cancer and ovari-an cancer] are estimated to cause 15–20% of all cancers inhumans [51,52]. Understanding the viral integration pat-tern of cancer-associated viruses may uncover novel onco-genes and tumor suppressors that are associated withcellular transformation.

Viral genomes have been detected using high through-put-sequencing technology [53–57]. The idea of using off-target reads to detect viruses was introduced a few yearsago. In general, viruses can be detected through exomesequencing either by detecting viral genome sequencesthat have been integrated into the host DNA or by inad-vertently capturing the viral sequence itself. The presenceof HPV [8] and HBV [58,59] has been detected throughanalysis of exome-sequencing data. Tools for detectingvirus sequence through exome-sequencing data have alsobeen developed. For example, PathSeq was developed toidentify viruses through sequencing data of human sam-ples [7]. VirusSeq was developed to identify viral sequencesusing exome-sequencing or RNAseq data [8]. Most recent-ly, ViralFusionSeq was developed [60] to discover viralintegration events and to reconstruct fusion transcriptsat single-base resolution. Theoretically, bacteria can alsobe detected in exome-sequencing data provided they arepresent. For example, PathSeq [7] is designed to captureboth bacterial and viral sequences.

One of the challenges associated with identifying viralsequences through exome-sequencing data is the rapidmutation rate of some viruses. DNA viruses have a muta-tion rate of between 10�6 to 10�8 mutations per base pergeneration, and RNA viruses have an even faster mutationrate of 10�3–10�5 per base per generation [61]. There aretwo possible solutions for identifying viral sequences witha high mutation frequency. First, the number of allowedmismatches per read can be increased. The typical readlength of exome sequencing is from 75 to 100 bp. Thedefault mismatch allowed per read for most popularaligners such as BWA [43] and Bowtie [62] is usuallytwo. Allowing more mismatches during the viral genomealignment can alleviate the problem caused by the fastviral mutation rate. Second, a virus reference panel can becreated that includes all known variations of a targetedvirus. Although this method can increase the alignmenttime, it is more likely to be accurate than simply allowingmore mismatches. However, it does have the disadvantageof potentially failing to detect viral strains that haveevolved significantly from the strains in the referencepanel. Another challenge associated with virus detectionin exome-sequencing data is the potential homology be-tween the reference human genome and viral genomes,similar to the problem of nuclear genome copies of themitochondrial genome described in the previous section.One conservative approach to solving this problem is to useonly reads unmapped to the human genome for the viralgenome alignment.

The location of virus integration into the host genomemay have a role in disease etiology [63–66]. However,identifying the sites of virus integration using exome-sequencing data is challenging. For paired-end read data,

Review Trends in Genetics October 2013, Vol. 29, No. 10

596

Page 45: Trends in genetics_-_october_2013

a single DNA fragment will have sequence reads on bothends. During alignment, discordant pairs can be detectedin which one read is aligned to the viral genome whereas itsmate is aligned to the human genome, a good indicator of apossible intervening integration site. To find the exactintegration site, read-through reads (in which the breakpoint lies within a read) need to be examined. Existingstructural variant detection tools, such as BreakDancer[67], can be used to detect integration sites if the viralgenome reference is added to the human genome referencebefore alignment. VirusSeq [8] detects integration sites byfirst identifying discordant read pairs and then clusteringthe discordant read pairs that support the same integra-tion event. By contrast, ViralFusionSeq [60] uses a moresophisticated model to detect breakpoints that supportviral fusion. Many viral fusion sites have been identifiedthrough exome sequencing. For example, in a study of livercancer, HBV integration was observed in 70 out of 81 livercancer samples [58]. Furthermore, HBV viral integrationsites have also been identified through exome sequencingin a separate liver study [59].

Virus detection through exome sequencing has severallimitations. First is the obvious limitation that this methodcan only detect DNA viruses or RNA viruses that arereverse transcribed and have a DNA phase. To detect anRNA virus, RNAseq technology needs to be used [68–72].Second, it is highly dependent on the amount of readssequenced. If the depth of exome sequencing is low, thechance of detecting any virus also decreases. Finally, it isimpractical for exome sequencing to detect any novel virus,or a virus with variants that have not been previouslydescribed. Nevertheless, there are successful examples ofdetection of viral genomes from exome sequencing, provid-ing another example of the value of reconsidering off-targetreads.

Concluding remarksExome-sequencing data are now becoming widely avail-able for secondary uses through efforts to encourage datasharing, such as TCGA (currently 15000 exomes) and theNHLBI Exome Sequencing Project (6500 exomes). It waswidely predicted that the price of whole-genome sequenc-ing for the human genome would drop to under US$1000 asearly as 2003 [73–75]. However, with currently availabletechnologies, to achieve an average of 30� coverage inwhole-genome sequencing still costs over $5000 a sample,whereas exome sequencing at 30� coverage costs underUS$500. There is always a possibility that an advance intechnology will reduce the cost of whole-genome sequenc-ing to a comparable price of exome sequencing. However,the extra cost associated with the data analysis of whole-genome sequencing data is likely to remain significantlyhigher. The storage and processing time of whole-genomesequencing data can be 10 to 20 times more than that ofexome sequencing data. Until these limitations of whole-genome sequencing cost and data storage are overcome,the growing amount of exome data available can be use-fully mined for additional research purposes.

Another future development that could impact the typesof secondary analysis we have outlined here are improve-ments in exome capture technology to eliminate or reduce

significantly off-target reads, Exome capture technology hasbeen continuously improving since it was introduced. How-ever, the capture efficiency has increased only slightly overthe years. Furthermore, the major reason for the increasedcapture efficiency has been due to the increased size ofcapture regions rather than improvement of the capturetechnology itself. For example, the Agilent SureSelect v1 kitcaptured 37 Mb of the human genome, whereas the latestSureSelect v5 kit captures 50 Mb. Additionally, the amountof output of sequencing instruments has also increased overthe years. The original Illumina GA II platform could output20 million–25 million reads per lane. The newest IlluminaHiSeq 2500 can produce 150 million–200 million reads perlane. Even after multiplexing three to four samples on aHiSeq lane, the amount of reads sequenced per sample isstill much higher than that achieved using the GA II ma-chine. Thus, even though the percentage of reads notmapped to target regions might decrease, the raw numberof reads not mapped to the target regions might increase dueto the increase of machine throughput, suggesting thatexome-sequencing data will continue to be good candidatesfor additional data mining despite technological improve-ments.

Several tools are now available to mine these data forthe ‘lost treasure’ buried in off-target reads. We havesummarized here the possibilities and challenges in study-ing variants outside of the targeted exonic regions. Theseinclude mitochondrial variants, as well as viral genomesand virus–host integration sites. However, we note thatanother possibility for some of the unmapped reads is thatthey may still belong to the human genome, but may comefrom genome regions not covered by the current humangenome reference, GRCh37. With GRCh38 (scheduled to bereleased during late 2013), it is likely that some of thepreviously unmapped reads will be mapped to the newhuman reference. There are also possibilities that have yetto be discovered, making studying the unmapped reads apotentially fruitful opportunity.

Although we are encouraging researchers to conductadditional data mining using existing data, we would alsolike to promote good study design. If the goal of the study isto survey all SNPs, then a whole-genome study should beused. If the goal of the study is to examine the mtDNAsequence, then mitochondria-targeted sequencing should beused, and if the goal is to detect the presence of viruses then avirus-specific method should be used. Exome sequencing is apowerful tool, but it is not designed specifically for theadditional targets described in this review. However, toget the fullest use of this low-cost sequencing technology,and of the massive amount of exome sequences currentlypublically available, we should not ignore the unexpectedDNA reads, which can comprise as much as half of the dataproduced by exome sequencing methods. The off-targetreads must be subject to stringent quality control and, thus,we recommend an additional validation phase for all impor-tant findings observed through off-target reads wheneverpossible, including the use of targeted resequencing.

References1 Ng, S.B. et al. (2010) Exome sequencing identifies the cause of a

mendelian disorder. Nat. Genet. 42, 30–35

Review Trends in Genetics October 2013, Vol. 29, No. 10

597

Page 46: Trends in genetics_-_october_2013

2 Durbin, R.M. et al. (2010) A map of human genome variation frompopulation-scale sequencing. Nature 467, 1061–1073

3 Fu, W. et al. (2013) Analysis of 6,515 exomes reveals the recent origin ofmost human protein-coding variants. Nature 493, 216–220

4 Sulonen, A.M. et al. (2011) Comparison of solution-based exomecapture methods for next generation sequencing. Genome Biol. 12, R94

5 Guo, Y. et al. (2012) Exome sequencing generates high quality data innon-target regions. BMC Genomics 13, 194

6 Asan et al. (2011) Comprehensive comparison of three commercialhuman whole-exome capture platforms. Genome Biol. 12, R95

7 Kostic, A.D. et al. (2011) PathSeq: software to identify or discover microbesby deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396

8 Chen, Y. et al. (2013) VirusSeq: software to identify viruses and theirintegration sites using nextgeneration sequencing of human cancertissue. Bioinformatics 29, 266–267

9 Larman, T.C. et al. (2012) Spectrum of somatic mitochondrialmutations in five cancers. Proc. Natl. Acad. Sci. U.S.A. 109, 14087–14091

10 Picardi, E. and Pesole, G. (2012) Mitochondrial genomes gleaned fromhuman whole-exome sequencing. Nat. Methods 9, 523–524

11 Yi, X. et al. (2010) Sequencing of 50 human exomes reveals adaptationto high altitude. Science 329, 75–78

12 Djebali, S. et al. (2012) Landscape of transcription in human cells.Nature 489, 101–108

13 Dunham, I. et al. (2012) An integrated encyclopedia of DNA elements inthe human genome. Nature 489, 57–74

14 Harrow, J. et al. (2012) GENCODE: the reference human genomeannotation for The ENCODE Project. Genome Res. 22, 1760–1774

15 Pei, B. et al. (2012) The GENCODE pseudogene resource. Genome Biol. 13,R51

16 Alberobello, A.T. et al. (2011) An intronic SNP in the thyroid hormonereceptor beta gene is associated with pituitary cell-specific over-expression of a mutant thyroid hormone receptor beta2 (R338W) inthe index case of pituitary-selective resistance to thyroid hormone. J.Transl. Med. 9, 144

17 Kawase, T. et al. (2007) Alternative splicing due to an intronic SNP inHMSD generates a novel minor histocompatibility antigen. Blood 110,1055–1063

18 Moyer, R.A. et al. (2011) Intronic polymorphisms affecting alternativesplicing of human dopamine D2 receptor are associated with cocaineabuse. Neuropsychopharmacology 36, 753–762

19 Rearick, D. et al. (2011) Critical association of ncRNA with introns.Nucleic Acids Res. 39, 2357–2366

20 Huang, F.W. et al. (2013) Highly recurrent TERT promoter mutationsin human melanoma. Science 339, 957–959

21 Guo, Y. et al. (2012) The effect of strand bias in Illumina short-readsequencing data. BMC Genomics 13, 666

22 DePristo, M.A. et al. (2011) A framework for variation discovery andgenotyping using next-generation DNA sequencing data. Nat. Genet.43, 491–498

23 Andrews, R.M. et al. (1999) Reanalysis and revision of the Cambridgereference sequence for human mitochondrial DNA. Nat. Genet. 23, 147

24 Verma, M. and Kumar, D. (2007) Application of mitochondrial genomeinformation in cancer epidemiology. Clin. Chim. Acta 383, 41–50

25 Fernandez-Vizarra, E. et al. (2007) Impaired complex III assemblyassociated with BCS1L gene mutations in isolated mitochondrialencephalopathy. Hum. Mol. Genet. 16, 1241–1252

26 Lemasters, J.J. et al. (1999) Mitochondrial dysfunction in thepathogenesis of necrotic and apoptotic cell death. J. Bioenerg.Biomembr. 31, 305–319

27 Wallace, K.B. and Starkov, A.A. (2000) Mitochondrial targets of drugtoxicity. Annu. Rev. Pharmacol. Toxicol. 40, 353–388

28 Modica-Napolitano, J.S. and Singh, K.K. (2004) Mitochondrialdysfunction in cancer. Mitochondrion 4, 755–762

29 Chen, E.I. (2012) Mitochondrial dysfunction and cancer metastasis. J.Bioenerg. Biomembr. 44, 619–622

30 Soares, P. et al. (2012) The Expansion of mtDNA Haplogroup L3 withinand out of Africa. Mol. Biol. Evol. 29, 915–927

31 Yao, Y.G. et al. (2002) Phylogeographic differentiation of mitochondrialDNA in Han Chinese. Am. J. Hum. Genet. 70, 635–651

32 Bandelt, H.J. et al. (2003) Identification of Native American foundermtDNAs through the analysis of complete mtDNA sequences: somecaveats. Ann. Hum. Genet. 67, 512–524

33 Kong, Q.P. et al. (2003) Phylogeny of east Asian mitochondrial DNAlineages inferred from complete sequences. Am. J. Hum. Genet. 73,671–676

34 Bogenhagen, D. and Clayton, D.A. (1974) The number of mitochondrialdeoxyribonucleic acid genomes in mouse L and human HeLa cells.Quantitative isolation of mitochondrial deoxyribonucleic acid. J. Biol.Chem. 249, 7991–7995

35 Guo, Y. et al. (2012) The use of next generation sequencing technologyto study the effect of radiation therapy on mitochondrial DNAmutation. Mutat. Res. 744, 154–160

36 Tang, S. and Huang, T. (2010) Characterization of mitochondrial DNAheteroplasmy using a parallel sequencing system. Biotechniques 48,287–296

37 He, Y. et al. (2010) Heteroplasmic mitochondrial DNA mutations innormal and tumour cells. Nature 464, 610–614

38 Ameur, A. et al. (2011) Ultra-deep sequencing of mousemitochondrial DNA: mutational patterns and their origins. PLoSGenet. 7, e1002028

39 Falk, M.J. and Sondheimer, N. (2010) Mitochondrial genetic diseases.Curr. Opin. Pediatr. 22, 711–716

40 Cancer Genome Atlas Network (2012) Comprehensive molecularportraits of human breast tumours. Nature 490, 61–70

41 Hazkani-Covo, E. et al. (2010) Molecular poltergeists: mitochondrialDNA copies (numts) in sequenced nuclear genomes. PLoS Genet. 6,e1000834

42 Li, M. et al. (2012) Fidelity of capture-enrichment for mtDNA genomesequencing: influence of NUMTs. Nucleic Acids Res. 40, e137

43 Li, H. and Durbin, R. (2009) Fast and accurate short read alignmentwith Burrows-Wheeler transform. Bioinformatics 25, 1754–1760

44 Guo, Y. et al. (2013) MitoSeek: extracting mitochondria informationand performing high throughput mitochondria sequencing analysis.Bioinformatics 29, 1210–1211

45 Shen, J. et al. (2010) Mitochondrial copy number and risk of breastcancer: a pilot study. Mitochondrion 10, 62–68

46 Yu, M. et al. (2007) Reduced mitochondrial DNA copy number iscorrelated with tumor progression and prognosis in Chinese breastcancer patients. IUBMB Life 59, 450–457

47 Tseng, L.M. et al. (2006) Mitochondrial DNA mutations andmitochondrial DNA depletion in breast cancer. Genes ChromosomesCancer 45, 629–638

48 Bai, R.K. et al. (2011) Mitochondrial DNA content varies withpathological characteristics of breast cancer. J. Oncol. 2011, 496189

49 Bhat, H.K. and Epelboym, I. (2004) Quantitative analysis of totalmitochondrial DNA: competitive polymerase chain reaction versusreal-time polymerase chain reaction. J. Biochem. Mol. Toxicol. 18,180–186

50 Castle, J.C. et al. (2010) DNA copy number, including telomeres andmitochondria, assayed using next-generation sequencing. BMCGenomics 11, 244

51 Parkin, D.M. (2006) The global health burden of infection-associatedcancers in the year 2002. Int. J. Cancer 118, 3030–3044

52 Morissette, G. and Flamand, L. (2010) Herpesviruses andchromosomal integration. J. Virol. 84, 12100–12109

53 Barzon, L. et al. (2011) Applications of next-generation sequencingtechnologies to diagnostic virology. Int. J. Mol. Sci. 12, 7861–7884

54 Radford, A.D. et al. (2012) Application of next-generation sequencingtechnologies in virology. J. Gen. Virol. 93, 1853–1868

55 Chevaliez, S. et al. (2012) New virologic tools for management ofchronic hepatitis B and C. Gastroenterology 142, 1303–1313

56 Li, L. and Delwart, E. (2011) From orphan virus to pathogen: the pathto the clinical lab. Curr. Opin. Virol. 1, 282–288

57 Capobianchi, M.R. et al. (2013) Next-generation sequencing technologyin clinical virology. Clin. Microbiol. Infect. 19, 15–22

58 Sung, W.K. et al. (2012) Genome-wide survey of recurrent HBVintegration in hepatocellular carcinoma. Nat. Genet. 44, 765–769

59 Jiang, Z. et al. (2012) The effects of hepatitis B virus integration intothe genomes of hepatocellular carcinoma patients. Genome Res. 22,593–601

60 Li, J.W. et al. (2013) ViralFusionSeq: accurately discover viralintegration events and reconstruct fusion transcripts at single-baseresolution. Bioinformatics 29, 649–651

61 Drake, J.W. et al. (1998) Rates of spontaneous mutation. Genetics 148,1667–1686

Review Trends in Genetics October 2013, Vol. 29, No. 10

598

Page 47: Trends in genetics_-_october_2013

62 Langmead, B. et al. (2009) Ultrafast and memory-efficient alignmentof short DNA sequences to the human genome. Genome Biol. 10,R25

63 Gozuacik, D. et al. (2001) Identification of human cancer-related genesby naturally occurring Hepatitis B Virus DNA tagging. Oncogene 20,6233–6240

64 Mason, W.S. et al. (2010) Clonal expansion of normal-appearing humanhepatocytes during chronic hepatitis B virus infection. J. Virol. 84,8308–8315

65 Murakami, Y. et al. (2005) Large scaled analysis of hepatitis B virus(HBV) DNA integration in HBV related hepatocellular carcinomas.Gut 54, 1162–1168

66 Saigo, K. et al. (2008) Integration of hepatitis B virus DNA into themyeloid/lymphoid or mixed-lineage leukemia (MLL4) gene andrearrangements of MLL4 in human hepatocellular carcinoma. Hum.Mutat. 29, 703–708

67 Chen, K. et al. (2009) BreakDancer: an algorithm for high-resolutionmapping of genomic structural variation. Nat. Methods 6, 677–681

68 Palacios, G. et al. (2008) A new arenavirus in a cluster of fataltransplant-associated diseases. N. Engl. J. Med. 358, 991–998

69 Nakamura, S. et al. (2009) Direct metagenomic detection of viralpathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach. PLoS ONE 4, e4219

70 Quan, P.L. et al. (2010) Astrovirus encephalitis in boy with X-linkedagammaglobulinemia. Emerg. Infect. Dis. 16, 918–925

71 Briese, T. et al. (2009) Genetic detection and characterization of Lujovirus, a new hemorrhagic fever-associated arenavirus from southernAfrica. PLoS Pathog. 5, e1000455

72 Isakov, O. et al. (2011) Pathogen detection using short-RNA deepsequencing subtraction and assembly. Bioinformatics 27, 2027–2030

73 Robertson, J.A. (2003) The $1000 genome: ethical and legal issues inwhole genome sequencing of individuals. Am. J. Bioeth. 3, 35–42

74 Mardis, E.R. (2006) Anticipating the 1,000 dollar genome. GenomeBiol. 7, 112

75 Bennett, S.T. et al. (2005) Toward the 1,000 dollars human genome.Pharmacogenomics 6, 373–382

Review Trends in Genetics October 2013, Vol. 29, No. 10

599

Page 48: Trends in genetics_-_october_2013

The role of AUTS2 inneurodevelopment and humanevolutionNir Oksenberg and Nadav Ahituv

Department of Bioengineering and Therapeutic Sciences, and Institute for Human Genetics, University of California, San Francisco

(UCSF), 1550 4th Street, San Francisco, CA 94158, USA

The autism susceptibility candidate 2 (AUTS2) gene isassociated with multiple neurological diseases, includ-ing autism, and has been implicated as an importantgene in human-specific evolution. Recent functionalanalysis of this gene has revealed a potential role inneuronal development. Here, we review the literatureregarding AUTS2, including its discovery, expression,association with autism and other neurological andnon-neurological traits, implication in human evolution,function, regulation, and genetic pathways. Throughprogress in clinical genomic analysis, the medical impor-tance of this gene is becoming more apparent, ashighlighted in this review, but more work needs to bedone to discover the precise function and the geneticpathways associated with AUTS2.

Neurodevelopmental disordersNeurodevelopmental disorders are characterized by motor,speech, cognitive, and behavioral dysfunctions caused byimpairment in growth and development of the centralnervous system (CNS). Neurodevelopmental disorders en-compass, but are not limited to, intellectual disability (ID),developmental delay (DD), and autism spectrum disorders(ASDs) [1]. ASDs are known as pervasive developmentaldisorders that are common (1/88 in the USA) [2] and highlyheritable [3]. ASDs are characterized by variable deficits insocial communication, language, and restrictive and repet-itive behaviors, and present as a wide spectrum of pheno-types [4]. Other neurological abnormalities, including ID,DD, epilepsy, sensory and motor abnormalities, gastroin-testinal phenotypes, developmental regression, sleep dis-turbance, mood disorders, conduct disorders, aggression,and attention deficit hyperactivity disorder (ADHD), arealso frequently associated with ASD [4]. Despite the heri-tability of these disorders, no single gene has been identi-fied as causative for ASD alone. Rather, several differentgenes have been implicated in these disorders containing

either common variants with small effects or rare variantswith larger consequences [5]. Over the years, studies ex-amining individual patients, together with advances insequencing technologies that have allowed the examina-tion of a large number of individuals, have produced amyriad of new ASD, ID, and DD candidate genes, includingAUTS2.

The discovery of AUTS2

AUTS2 was first identified in 2002 when it was found to bedisrupted as a result of a balanced translocation in a pair ofmonozygotic (MZ) twins with ASD [6]. AUTS2 was mappedto 7q11, spans 1.2 Mb, and is approximately 340 kb up-stream from the Williams–Beuren syndrome (WBS) criti-cal region, a region that – when deleted – causes aneurodevelopmental disorder characterized by a distinc-tive ‘elfin’ facial appearance, a cheerful demeanor, devel-opmental delay, strong language skills, and cardiovascularproblems [7]. The AUTS2 protein sequence is highly con-served, with 62% amino acid conservation betweenhumans and zebrafish [8]. It contains regions of homologyto other proteins, such as the dwarfin family consensussequence, human topoisomerase, and fibrosin (FBRS), afibroblast growth factor [6]. In addition, the Drosophilagene tay has limited similarity to AUTS2. tay mutantshave reduced walking speed and activity, thought to beassociated with structural defects in the protocerebralbridge [9]. Sequence analysis of AUTS2 identified no mem-brane-spanning domains, but identified two proline-richdomains and a predicted PY (ProTyr) motif (PPPY) atamino acids 515–519 (Figure 1) [6]. The PY motif is apotential WW-domain-binding region that is involved inprotein–protein interactions and is present in the activa-tion domain of various transcription factors, suggestingthat AUTS2 may be involved in transcriptional regulation[8]. Other predicted protein motifs include several cAMPand cGMP-dependent protein kinase phosphorylationsites, and putative N-glycosylation sites [6]. In addition,AUTS2 has eight CAC (His) repeats (Figure 1) [6], whichhave been shown to be associated with localization atnuclear speckles [10] – subnuclear structures where com-ponents of the RNA splicing machinery are stored andassembled [11]. Evidence of nuclear localization sequencesas well as several predicted protein–protein interactiondomains (SH2 and SH3) were also observed for this protein

Review

0168-9525/$ – see front matter

� 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2013.08.001

Corresponding author: Ahituv, N. ([email protected]).Keywords: AUTS2; autism; neurodevelopment; human evolution.

600 Trends in Genetics, October 2013, Vol. 29, No. 10

Page 49: Trends in genetics_-_october_2013

(Figure 1). No evidence was found for any signal peptide inAUTS2, indicating that it is not secreted or exposed to thecellular membrane [12]. No DNA-binding domains havebeen identified. Taken together, sequence analysis hasrevealed limited insight into the function of this gene.

AUTS2 is a nuclear protein that is expressed in the CNSMultiple reports have characterized the expression ofAUTS2 in different organisms, concluding that it is pri-marily expressed in the brain. Northern blot shows strongAUTS2 expression in human fetal brain in the frontal,parietal, and temporal regions, but not in the occipital lobe.Expression was also identified in the skeletal muscle andkidney, with lower expression in the placenta, lung, andleukocytes [6]. In human post-mortem fetal brain, AUTS2mRNA expression was found in the telencephalon (uni-formly), ganglionic eminence, cerebellum anlagen, and,more weakly, in the medulla oblongata at 8 weeks. AUTS2was also found to be strongly expressed in the cortical plateand ventricular zone. Fetal (23 weeks) human brainsshowed AUTS2 expression in the dentate gyrus, CA1and CA3 pyramidal cell subregions, the ganglionic emi-nence, caudate nucleus, and putamen nuclei [13]. AUTS2was also shown to be expressed in the neocortex andprefrontal cortex up to the late mid-fetal stage [14]. Gene

expression profiles from 10 human ocular tissues foundAUTS2 to be the 20th highest expressed gene in the sclera[15]. Sequencing of total RNA from human brain and liverfound a large fraction of reads (up to 40%) to be withinintrons [16]. The authors identified enrichment of intronicRNA in brain tissues, particularly for genes involved inaxonal growth and synaptic transmission. AUTS2 wasamong the 10 genes with the highest intronic RNA scorein fetal brain. Three of the top 10 genes – neurexin 1(NRXN1), protocadherin 9 (PCDH9), and methionine sulf-oxide reductase A (MSRA) – have also been implicated inautism. In addition, for long introns, including the first halfof AUTS2, there is a 50 to 30 slope in read coverage, withsignificantly higher levels of RNA at the 50 end. Theauthors reason that, in the fetal brain, intronic RNAsare subjected to brain-specific regulatory pathways thatregulate alternative splicing programs to control neuronaldevelopment [16].

A detailed analysis of Auts2 mRNA and protein expres-sion in the developing mouse brain was published in 2010[12]. The authors found that Auts2 is expressed in thedeveloping cerebral cortex and cerebellum, and is locatedin the nuclei of neurons and some neuronal progenitors(Table 1). Auts2 expression was identified in numerousneuronal cell types, including glutamatergic neurons

AUTS2

Dwarfin homology region (326–453)Fibrosin homology region (645–798)Proline-rich domain (288–471, 545–646)Serine-rich domain (383–410)Trinucleo�de (H) repeat (1126–1133)PY mo�f (515–519)

Human topoisomerase homology region (880–920)Nuclear localiza�on sequence (11–27, 70–79, 120–141)

Predicted cAMP and cGMP-dependent protein kinase phosphoryla�on site (13–16, 77–80, 116–119, 832–835, 849–852, 975–978, 1235–1238)

Y Predicted SH2 interac�on domain (Y971)

N Puta�ve N-glycosyla�on site (395–398, 785–788, 955–958, 1009–1012)P Predicted SH3 interac�on domain,

(P67, P72, P73, P266, P332, P361, P364, P467, P468, P471, P638, P806, P1234)

Dwarfin homology re gion (326–453)Fibros in homology reg ion (645–798)Proline-rich domain (288–471, 545–646)Serine-rich domain (383–410)Trinucleo�de (H) repeat (1126–1133)PY mo�f (515–519)

Human topois omeras e homology region (880–920)Nuclear localiza�o n sequence (11–27, 70–79, 120–141)

Predicted cAMP and cGMP- dependent prote in kinase phosphoryla�on site(13–16, 77–80, 116–119, 832–835, 849–852, 975–978, 1235–1238)

N Puta�ve N-glycosyla�o n site (395–398, 785–788, 955–958, 1009–1012)P Predicted SH3 interac�on domain

N N N NPPP P P PP PPP P P PY

TRENDS in Genetics

Figure 1. Schematic of the AUTS2 protein. AUTS2 (1259 amino acids) is shown as a gray bar (individual amino acids in single-letter code). The locations of predicted

domains, motifs, regions of homology, and other characterized sequences are shown below and within the protein. Numbers in parenthesis represent the amino acid

location. The figure is based on predicted features in [6,12].

Table 1. Auts2 expression in the developing mouse braina

Timepointb Auts2 expression

E11 mRNA barely detectable.

E12–13 Colocalization with Tbr1 in the cortical preplate. Tbr1 is a transcription factor specific for postmitotic projection neurons.

E12–14 High expression in the developing cortex, thalamus, and cerebellum. There is continued expression in these regions throughout

development, but levels fluctuate and are found in gradients. Different markers show Auts2 expression in multiple neuronal

subtypes in the developing cortex.

E14 Expression in the hippocampal primordium. Transient expression in the locus ceruleus and vestibular nuclei.

E16 Expression in the cerebral cortex is now a gradient of high rostral to low caudal expression.

E19 Highest expression in inferior and superior colliculi and the pretectum.

P0 Auts2 expression becomes progressively more superficial in the frontal cortex. Coexpression with Tbr1 becomes rare as Tbr1

becomes more selective to layer 6.

E16–P21 Auts2 is expressed mostly in the frontal cortex, hippocampus, and the cerebellum. In addition, high expression levels were

detected in the developing dorsal thalamus, olfactory bulb, inferior colliculus and the substantia nigra.

P21 Expression in developing thalamic areas, including the anterior thalamic nuclei and in ventrolateral/ventromedial nuclei. Auts2 is

restricted to superficial layers in frontal cortex. Auts2 is expressed throughout the subgranular zone and the granule cell layer of the

hippocampus.

aSummary based on [12].

bE, embryonic day; P, postnatal day.

Review Trends in Genetics October 2013, Vol. 29, No. 10

601

Page 50: Trends in genetics_-_october_2013

(cortex, olfactory bulb, hippocampus), GABAergic neurons(Purkinje cells), and tyrosine hydroxylase (TH)-positivedopaminergic neurons (substantia nigra and ventral teg-mental area). Colocalization of Auts2 with only a subset ofeomesodermin (Tbr2) and paired box 6 (Pax6)-positive cellswas demonstrated in the ventricular and subventricularzones, suggesting that Auts2 might be expressed in thetransition between radial glial and intermediate progeni-tors [12]. It was also suggested that Auts2 and T-box brain 1(Tbr1) are coexpressed mostly in glutamatergic neuronpopulations in the forebrain, and other transcription factorslikely influence expression of Auts2 in other regions. Thereport also notes that Auts2 could be expressed in a tran-sient phase of neuronal maturation or differentiation in thecortex [12]. In zebrafish, using wholemount in situ hybrid-ization, auts2 was shown to be expressed in the brain at 24,48, 72 and 120 hours post-fertilization (hpf). At 48 hpf, auts2is also expressed in the pectoral fin. From 24–130 hpf, auts2is also weakly expressed in the eye [17]. In summary, AUTS2has been shown to be a nuclear protein that is primarilyexpressed in the brain in various cell types as well as inregions implicated in ASD, such as the neocortex.

AUTS2 and ASD, ID, and DDAUTS2 has been repeatedly implicated as an ASD candi-date gene in recent years. Following the initial finding of anAUTS2 translocation in twins with autism [6], over 50unrelated individuals with ASD, ID, or DD were identifiedwith distinct structural variants disrupting the AUTS2region in numerous different reports (Figure 2) [8,18–30].

Some of the structural variants are exclusively non-coding,suggesting that improper regulation and subsequent ex-pression of AUTS2 could be involved in the progression ofthe disorder [17]. In addition to ASD, ID, and DD, many ofthese individuals also have other phenotypes, includingepilepsy, brain malformations, or dysmorphic features. Onegroup described an ‘AUTS2 syndrome’ in individuals withvarying severity of growth and feeding problems, neurode-velopmental features, neurological disorders, dysmorphicfeatures, skeletal abnormalities, and congenital malforma-tions [26]. The spectrum of phenotypes observed in individ-uals with AUTS2 mutations is consistent with the widerange of ASD phenotypes. This suggests that AUTS2 isnot associated with a specific subtype of ASD. It has alsobeen noted that dysmorphic features were more pronouncedin individuals with 30 AUTS2 deletions, where most of thecoding region resides [26]. However, copy-number variations(CNVs) at the AUTS2 locus have also been observed inunaffected individuals, indicating that structural rearran-gements are tolerated in some cases [19,31]. This suggeststhat disruptions in AUTS2 may lead to neurodevelopmentaldisorders by being one of multiple genomic ‘hits’. The largenumber of independent publications implicating AUTS2 inASD, ID, or DD provides strong evidence for its involvementin these disorders. It is worth noting, however, that nopublication has shown single base-pair variants in theAUTS2 locus affiliated with ASD, despite numerous ASD-related exome sequencing studies [32–35].

The observation that AUTS2 variants are mostlyCNVs may be due to the susceptibility of this region to

ADHD 49Dyslexia 23

LD, Motor delay 24Failure to thrive, Macrocephaly 24

LD, Motor delay 24

250kb68 Human–Neanderthal sweep

67 HACNS369 67 HACNS17466 HAR31

Alcohol consump�on

Epilepsy 48Epilepsy 48SD, MCA 24

Behavior problems 24Microcephaly, DF 24

DF, Microcephaly 24Ataxia 24

Dyslexia 23

ASD, ID, and/or DD

Other neurological phenotypes

23 182424

2424 24 24 24

24

24 2424

2525

2626

26 26 2626 26

26 2626 26

2626

262630

2926

1924

21 22242424

242424

24 2424

242525

24 2820 8

88

626

4246 46 46 46 46 46

47

TRENDS in Genetics

Figure 2. Schematic of the AUTS2 genomic region. Numbers to the left of the lines correspond to reference numbers. Human accelerated sequences are shown as blue

lines above the gene [66–68]. Structural variants [6,8,18–26,28–30,48,49] are represented as colored lines (red, deletion; orange, inversion; green, duplication; purple,

translocation). Single-nucleotide polymorphisms (SNPs) are shown as magenta stars. rs6943555 is associated with alcohol consumption [42]. SNPs in [46,47] are associated

with bipolar disorder. SNPs in [46] are reported to be in strong linkage disequilibrium with each other. Arrows in bars signify that the structural variant extends past the

gene in that direction. Exons are depicted as light-blue rectangles, as defined by the RefSeq genes track in the University of California, Santa Cruz (UCSC) Genome Browser.

DD, developmental delay; DF, dysmorphic features; HACNS, human accelerated conserved non-coding sequence; HAR, human accelerated region; ID, intellectual disability;

LD, language disability; MCA, multiple congenital anomalies; SD, seizure disorder. Figure adapted from [17].

Review Trends in Genetics October 2013, Vol. 29, No. 10

602

Page 51: Trends in genetics_-_october_2013

chromosomal breakpoints. A 2011 report showed that theoffspring of older male mice have an increased risk of denovo CNVs in specific locations, including the Auts2 locus[36]. Another report found that hydroxyurea, a ribonucle-otide reductase inhibitor, as well as aphidicolin, a DNApolymerase inhibitor, induce a high frequency of de novoCNVs in cultured human cells, and found a clustering ofCNVs in AUTS2 [37]. Aphidocolin also induced CNV for-mation in the Auts2 locus in non-homologous end-joiningdeficient mouse embryonic stem cells [38]. Because theAUTS2 locus is a hotspot for CNVs, and individuals withASD generally carry more CNVs than their unaffectedsiblings [39], examining if these high numbers of ASD-associated CNVs around AUTS2 are consequential, andnot merely a result of their susceptibility to CNVs, war-rants investigation. There is also the possibility that theseCNVs affect regulatory regions of other genes, includingthe nearby WBS critical region.

In 2013, a genome-wide analysis of DNA methylationwas published on ASD discordant and concordant mono-zygotic twins. A region in the AUTS2 promoter (chr7:68701907; hg18) was the 42nd most differentially methyl-ated CpG site in the genome, suggesting that not onlysequence variation but also epigenetic changes to theAUTS2 locus could be involved in the development ofASD-related traits [40]. Significant DNA methylation dif-ferences were often observed near other genes that havebeen previously implicated in ASD, including methyl-CpGbinding domain protein 4 (MBD4) and microtubule-associ-ated protein 2 (MAP2). The authors cautioned, however,that it is difficult to draw conclusions about the causality ofthe differentially methylated sites due to small samplesize, lack of corresponding RNA expression data, the use ofwhole blood rather than brain tissue, and potential epige-netic effects due to medicine [40].

Combined, the evidence for a causative role of AUTS2 inDD and ID is convincing. However, for ASD the evidencepresented so far suggests that disruptions in AUTS2 canplay a causative role, but to demonstrate causality moreresearch needs to done on cohorts of well-defined ASDpatients and on the functional consequence of these dis-ruptions.

AUTS2 and other neurological conditionsIn addition to ASD, ID, and DD, AUTS2 has been impli-cated in other neurological disorders. Some of these dis-orders, such as epilepsy, have been shown to be linked toASD. However, other AUTS2-associated phenotypes areASD-independent. AUTS2 expression was found to havesignificant association with nicotine-dependence, canna-bis-dependence, and antisocial personality disorder, al-though this study had a small number of cases andwould need to be repeated with larger cohorts [41]. Thestudy also suggested, although it did not reach signifi-cance, that AUTS2 expression is implicated in alcoholdependence [41]. In 2011 a genome-wide associationmeta-analysis found an AUTS2 non-coding single-nucleo-tide polymorphism (SNP), rs6943555, to be significantlyassociated with alcohol consumption [42]. The authors alsoreported increased AUTS2 expression in carriers of theminor A allele of rs6943555 compared with the T allele in

96 human prefrontal cortex samples. In addition, theyidentified significant differences in expression of Auts2in whole-brain extracts of mice with differences in volun-tary alcohol consumption. The authors also showed thatdownregulation of tay, which has sequence similarity toAUTS2, caused reduction in alcohol sensitivity in Drosoph-ila [42]. Also implicating AUTS2 in drug dependence was a2011 study showing that AUTS2 has a 3.01-fold change(downregulation) between 19 male heroin-dependent indi-viduals and 20 controls in lymphoblastoid cell lines [43]. Afollow-up study compared AUTS2 transcript levels of lym-phoblastoid cell lines between 124 heroin-dependent and116 control males using quantitative PCR – and found thataverage transcript levels of AUTS2 in the heroin-depen-dent group were significantly lower than in controls. Theyalso found that AA homozygotes for rs6943555 were sig-nificantly over-represented in the heroin-dependent sub-jects [44]. Taken together, these reports show strongevidence for AUTS2 involvement in addiction and depen-dence.

In addition, the AUTS2 locus has been shown to beimplicated or altered in individuals with schizoaffectivedisorder [45], bipolar disorder [46,47], epilepsy [48], ADHD[49], differential processing speed [50], suicidal tendenciesunder the influence of alcohol [51], and dyslexia [23], eitherthrough CNV or genome-wide association studies. A 2012article sequenced balanced chromosomal abnormalities inpatients with neurodevelopmental disorders, and foundthe AUTS2 locus to be perturbed in individuals withmicrocephaly, macrocephaly, ataxia, visual impairment,language disability, seizure disorder, dysmorphic features,behavioral problems, motor delay, or Rubinstein–Taybisyndrome [24]. It could be that the observation that mostcases of AUTS2 structural variants are associated withASD is attributed to more individuals with ASD beingtested in this locus than patients with other neurologicaldisorders – thereby leading to an underestimate in the link

Box 1. AUTS2 and non-neurological disorders and traits

A few reports have implicated AUTS2 in non-neurological disorders

and traits. In 2004, 18 cases of childhood hyperdiploid acute

lymphoblastic leukemia (ALL) were examined to identify the

relationship between extra copies of chromosomes and increased

gene expression. The authors identified multiple regions with

increased expression that correlated poorly or not at all with the

presence of extra copies of chromosomes, including 7q11.2. AUTS2

showed consistently higher expression levels in the cDNA samples

of patients than in normal mononuclear cells, possibly implicating

the gene in ALL [69]. In 2008 it was reported that paired box 5 (PAX5)

can be rearranged with a variety of partners, including AUTS2 (one

case) in pediatric ALL [70]. Two years later a second case of PAX5–

AUTS2 fusion was identified in pediatric ALL [71]. In 2012, the third

case of PAX5–AUTS2 fusion was identified in a patient with pediatric

ALL, providing additional evidence that PAX5–AUTS2 is a recurring

gene fusion in ALL [72]. Two of the three PAX5–AUTS2 cases had

CNS diseases either at the time of diagnosis or relapse [72].

Individual reports, some of which identify single patients, have also

implicated the AUTS2 locus in the aging of human skin [73], lung

adenocarcinoma [74], lethal prostate cancer [75], the number of

corpora lutea in pigs [76], early-onset androgenetic alopecia [77],

and metastatic non-seminomatous testicular cancer [78]. Despite

several reports suggesting a role for AUTS2 in non-neurological

disorders and traits, disruption of AUTS2 is most often reported to

be associated with neurological phenotypes.

Review Trends in Genetics October 2013, Vol. 29, No. 10

603

Page 52: Trends in genetics_-_october_2013

between AUTS2 and other neurological phenotypes. Takentogether, these observations suggest that AUTS2 dysfunc-tion is not restricted to ASD, DD, or ID, but instead AUTS2dysfunction is involved in a wide range of neurologicaldisorders. In addition, a few studies implicate AUTS2 innon-neurological disorders and traits (Box 1).

The function and regulation of AUTS2

Despite the many articles linking AUTS2 to human dis-ease and other traits, few papers have been publisheddescribing the function of the gene. In 2013, morpholinoknockdowns of auts2 were performed in zebrafish by twodifferent groups [17,26]. The observed phenotypes aresummarized in Figure 3 and Table 2. Using HuC (Huantigen C), a neuronal marker, both groups observed adecrease in neuronal cells in the brain (Figure 3B). In-creased apoptosis and cell proliferation in the brain wasreported, and it was noted that this observation could be aresult of morphant cells failing to differentiate into matureneurons, which matches the HuC results [17]. Althoughincreased cell proliferation was observed in one study [17],another study described decreased cell proliferation [26].The differences in this phenotype could be due to differ-ences in the stains used (proliferating cell nuclear antigen,PCNA, which marks cells in early G1- and S-phase versusphosphohistone-H3, a marker of cells in G2 and M phase).Both reports, however, found that auts2 knockdown cellsshow more replicating DNA, but fewer cells dividing intodaughter cells. The craniofacial phenotype of the morphant

fish was also characterized in one of the studies, findingthat they have micrognathia (undersized jaw) and retro-gnathia (receded jaw) (Figure 3C) [26]. Given that migrat-ing neural crest cells play an important role in craniofacialdevelopment [52], it is possible that this phenotype is aresult of defects in neuronal cell development. In addition,less movement was reported in morphant fish, and thiscould be caused by fewer motor neuron cell bodies in thespinal cord, together with improperly angled and weakerprojections, and/or fewer sensory neurons, both of whichwere observed in morphant fish [17]. Although one groupobserved overall stunted development [17] (Figure 3A), theother reported a phenotype restricted to the brain and jaw[26]. A potential cause for the difference in this phenotype,alongside the differences in cell proliferation phenotypes,could be due to the use of different morpholinos for theseassays: an auts2 translational morpholino [17] versussplicing morpholinos [26]. Both groups were able to rescuethe morphant phenotype by injecting full-length humanAUTS2 mRNA together with the morpholino [17,26]. Themorphant phenotype was also rescued by injecting theshorter C-terminal isoform of AUTS2, suggesting thatthe final nine exons of AUTS2 contain the crucial regionof the gene, at least for the dysmorphic phenotype observedin knockdown fish. This is in line with the observation thatdysmorphic features were more pronounced in individualswith 30 AUTS2 deletions [26]. The zebrafish knockdownphenotypes appear to be an overall neurodevelopmentdefect, making it difficult to truly parse out the function

otce

ret

chMk

auts2 MorpholinoControl

(C) Alcian blue, 120 hpf

(B) HuC–GFP, 48 hpf

(A) Wholemount, 48 hpf

TRENDS in Genetics

Figure 3. auts2 zebrafish knockdown phenotype. (A) At 48 hours post-fertilization (hpf), fish injected with a 5 bp mismatch auts2 morpholino (MO) control have a similar

morphology to wild type fish, whereas fish injected with a corresponding translational MO display a stunted developmental phenotype that includes a smaller head, eyes,

body, and fins. (B) At 48 hpf, HuC–GFP fish injected with a 5 bp mismatch auts2 control MO display normal levels of developing neurons in the brain, whereas translational

MO injected fish display less developing neurons in the cerebellum (ce), optic tectum (ot), and retina (ret). (C) At 120 hpf, fish injected with an auts2 splicing MO and stained

with Alcian blue show a significant reduction in the distance between the Meckel (Mk) and ceratohyal cartilages (ch) (shown as a red line) compared to controls, indicating a

reduced lower-jaw size. Panels (A, B) adapted from [17], (C) adapted from [26].

Review Trends in Genetics October 2013, Vol. 29, No. 10

604

Page 53: Trends in genetics_-_october_2013

of this gene. To understand AUTS2 function better, aconditional knockout mouse should be developed.

Given the observation that non-coding regions withinAUTS2 have been implicated in human evolution (Box 2)and disease, the regulatory landscape around AUTS2 wasinvestigated [17]. Twenty-three enhancers were identifiedin zebrafish, 10 of which are active in the brain. Threemouse brain enhancers were found to overlap a purely non-coding ASD-associated deletion, and four different mouseenhancers (two of which were positive in the brain) werefound to reside in regions implicated in human evolution,supporting the idea that this gene is tightly regulated, andthat enhancers for this gene are important for health andevolution [17]. The enhancers described are potentiallyonly a subset of the AUTS2 regulatory landscape – andit is possible that some of these enhancers regulate othergenes, including those in the WBS critical region. Althoughthe precise function of AUTS2 remains to be elucidated,current reports show it to be a crucial and tightly regulatedgene involved in neurodevelopment.

AUTS2 gene pathwaysA 2010 study used radiation hybrid genotyping data to testfor interaction of 99% of all possible gene pairs across themammalian genome [53]. AUTS2 was the known gene withthe greatest number of edges, or connectivity [53]. Despitethat finding, little is known about the genetic pathways inwhich AUTS2 is involved. However, a few articles haveprovided evidence linking AUTS2 to other proteins andpathways.

One potential pathway was revealed by examininggenes that can oscillate expression during somitogenesis.Two papers found that the expression of AUTS2 oscillatesin phase with other notch pathway genes, suggesting thatit is a component of the notch signaling pathway [54,55].Notch signaling has been shown to be involved in neuronalmigration through its interaction with Reelin, a gene im-plicated in ASD and a target of Tbr1 [56,57].

Although not reaching significance, a group found thatAuts2 has a 1.33-fold change in cerebellar gene expressionin methyl CpG binding protein 2 (Mecp2)-null mice. Loss ofMECP2 function can cause neurodevelopmental disordersincluding Rett syndrome and autism [58]. The authors alsocompared their data with data generated from other gene

expression studies. They found that Auts2 is consistentlyaltered in both their datasets, as well as in post-mortemRett syndrome patient brain, and is mutated in fibroblastsand lymphocytes [58].

Starting at mouse embryonic (E) day 12, Auts2 mRNA isexpressed in the cortical preplate, where it colocalizes withTbr1, a transcription factor that exerts positive and nega-tive control of regional and laminar identity in postmitoticneurons [12,59]. Using Tbr1 antibodies for chromatinimmunoprecipitation (ChIP) of E14.5 cortex, it was shownthat the Auts2 promoter is a direct transcriptional target ofTbr1 in the developing neocortex and is involved in frontalidentity [59].

SATB homeobox 2 (Satb2) is one of four genes (includingTbr1) that regulates projection identity within the layers ofthe mammalian cortex. In 2012 a report showed that, inmice, Tbr1 expression is dually regulated by Satb2 and Bcell CLL/lymphoma 11B (Ctip2) in cortical layers 2–5. Theauthors also demonstrated that Satb2 regulates Auts2.They showed that, similarly to Tbr1, Auts2 is expressedin the deep and upper layers of the cortex. They investi-gated whether the loss of Tbr1 expression in the upperlayer neurons in Satb2 mutants coincides with changes inAuts2 expression. They observed that there was a signifi-cant loss of Auts2 expression in the upper layers of Satb2mutants, similar to the loss of Tbr1 in Satb2 mutants. Theauthors did not observe any changes in Auts2 expression inlayers 5 or 6. Their results suggest that Satb2 regulates theexpression of Tbr1, which in turn regulates Auts2 expres-sion in callosal projection neurons [60].

GTF2I repeat domain containing 1 (GTF2IRD1) is oneof 26 genes deleted in WBS, and encodes a putative tran-scription factor expressed throughout the brain duringdevelopment. Gtf2ird1 knockout mice display reduced in-nate fear and increased sociability, phenotypes consistentwith WBS [61]. Microarray screens were used to findtranscriptional targets of Gtf2ird1 in brain tissue fromGtf2ird1 knockout mice at two timepoints – E15.5 andbirth [postnatal (P) day 0] – versus wild type littermates.Auts2 was one of only two genes identified in both (E15.5and P0) microarray experiments to be altered compared tocontrols. In P0 mouse brains of knockout mice, Auts2 wasincreased by 1.3-fold, whereas in E15.5 embryos it wasdecreased by 1.5-fold [62]. It is unclear if Auts2 is a target of

Table 2. auts2 morpholino knockdown phenotypes

Assay following

morpholino injectionaDevelopmental phenotype Refs

Wholemount Overall stunted development, including smaller head and eyes (Figure 3A). Less movement when prodded. [17]

Microcephaly with no overall developmental delay. [26]

Alcian blue staining Micrognathia (undersized jaw) and retrognathia (receded jaw) (Figure 3C). [26]

HuC–GFP zebrafish line Fewer developing neurons in the dorsal region of the midbrain, including the optic tectum, the midbrain-

hindbrain boundary (including the cerebellum), the hindbrain and the retina [17] (Figure 3B).

[17]

HuC/D staining Reduction in HuC/D-positive postmitotic neurons as well as a loss of bilateral symmetry. [26]

TUNEL staining Increased apoptosis in the midbrain. [17]

PCNA staining Increased cell proliferation in the forebrain, midbrain and hindbrain. [17]

Phosphohistone H3 Decreased cell proliferation in the brain. [26]

Tg(mnx1:GFP) zebrafish line Fewer motor neuron cell bodies in the spinal cord and weaker, improperly angled projections. [17]

HNK-1 staining Fewer sensory neurons in the spinal cord. [17]

aHNK-1, neural cell adhesion molecule 1/Ncam1 (CD57); HuC/D, Hu antigen C/D [ELAV (embryonic lethal, abnormal vision, Drosophila)-like 3/4]; mnx1, motor neuron and

pancreas homeobox 1; PCNA, proliferating cell nuclear antigen; Tg, transgenic; TUNEL, terminal deoxynucleotidyl transferase dUTP nick end-labeling.

Review Trends in Genetics October 2013, Vol. 29, No. 10

605

Page 54: Trends in genetics_-_october_2013

Gtf2ird1 or if this observation reflects the proximity of thetwo genes.

Zinc finger matrin-type 3 (Zmat3, also known as Wig1),a transcription factor regulated by p53, plays an importantrole in RNA protection and stabilization and, as part of thep53 pathway, is a casual factor in neurodegenerative dis-eases. Wig1 downregulation by antisense oligonucleotidetreatment led to a significant reduction in Auts2 mRNAlevels in the brains of BACHD (bacterial artificial chromo-some – HD) mice, a mouse model for Huntington’s disease(HD). The authors also reported a trend in reduction ofAuts2 mRNA levels in the livers of BALB/c mice but noreduction in Auts2 levels in FVB (background strain ofBACHD) mouse brains [63]. These results suggest a role forWig1 in the regulation of Auts2 expression and furtherlinks Auts2 with pathways involved in the CNS.

Polycomb repressive complex 1 (PRC1) is a polycombgroup (PcG) gene which acts as a developmental regulatorthrough transcriptional repression. It is crucial for manybiological processes in mammals, including differentiation.There are six major groups of PRC1 complexes, each con-taining a distinct polycomb group ring finger 1 (PCGF)subunit (PCGF1–6), a RING1 A/B ubiquitin ligase, andunique associated polypeptides. Using tandem affinitypurification of PCGF3 and PCGF5, AUTS2 was recovered,implying a role for AUTS2 in transcriptional repressionduring development [64].

In 2013, the regulatory pathway for SEMA5A (sema-phorin 5A), an autism candidate gene, was mapped in silico

using expression quantitative trait locus (eQTL) mapping.The authors found that the SEMA5A regulatory networksignificantly overlaps with rare CNVs around ASD-associ-ated genes, including AUTS2. Given the extensive trans-regulatory network associated with SEMA5A, the authorsalso investigated the possibility that there are severalupstream master regulators that control this network.Performing eQTL mapping for expression levels of theeQTL-associated genes within the network (eQTLs ofthe eQTLs of SEMA5A), the authors identified 12 regionsassociated with the expression of 10 or more primarySEMA5A eQTL genes, including AUTS2. This study sug-gests that AUTS2 is involved, and may be a master regu-lator in ASD-related pathways [65].

Concluding remarksAs we identify the genes involved in ASD, DD, and ID, ourability to genetically diagnose these disorders improves,and future screens should assess AUTS2 for potentialcausative CNVs. However, before we are able to useAUTS2 as a diagnostic tool we must determine whatmakes a CNV in or around AUTS2 causative or benignand for what disorders (e.g., ID, DD, ASD, ASD with ID/DD, etc.). This includes a deeper investigation of theregulatory network of this gene. Although not in immedi-ate sight, a major step in developing future ASD and ASD-related phenotype treatments relies on a solid understand-ing of the pathways involved and how they interact. Mul-tiple reports have implicated AUTS2 in addiction andother neurological phenotypes, but the mechanism andcertainty of these involvements remain unclear, highlight-ing the need for deeper investigations into the function ofthis gene and its role in development and disease. Futurework using an Auts2 mouse knockout should reveal greaterdetail of the function of this gene. In addition, genomicstudies such as RNA-seq following the knockdown of thisgene and chromatin immunoprecipitation followed by deepsequencing (ChIP-seq) could identify the various genepathways and regions of the genome with which this geneinteracts. Obtaining a better understanding of the path-ways associated with AUTS2 will allow us to comprehendbetter the biological systems that can be perturbed whenthe function of this gene is disrupted, as well as hownucleotide changes within the gene might have led tohuman-specific traits. In summary, we can presume thatthis gene is involved in neurodevelopment, and may play arole in ASD and ASD-related phenotypes. There are alsosignificant data suggesting that AUTS2 has human-spe-cific variants that could possibly contribute to humancognition. It is important to differentiate the evolutionand phenotypic data surrounding this gene. The datasuggests that genes involved in human specific cognitionmay also play a role in human-specific disorders of thebrain.

AcknowledgmentsWe would like to thank Christelle Golzio, Nicholas Katsanis, and Erik A.Sistermans for sharing their work on auts2 including their morpholinoresults used in Figure 3C. We would also like to thank members of theAhituv lab for helpful comments. N.A. and N.O. received support for thisresearch from the Simons Foundation (SFARI grant 256769 to N.A.),National Human Genome Research Institute (NHGRI) grant number

Box 2. AUTS2 and human evolution

In 2006 a comparative genomics approach was used to search the

human genome for regions that have significantly changed in

humans in the past 5 million years, since the divergence from

chimpanzees, but are highly conserved in other species [66,79].

They identified 202 such regions which they termed human

accelerated regions (HARs). These HARs are strong candidates for

sequences responsible for the evolution of human-specific traits. An

intronic region in AUTS2 (Figure 2) ranked as the 31st most

accelerated region in their study. Similarly, in 2006 a different group

combed the genome for conserved non-coding sequences in the

human lineage that displayed accelerated evolution [67]. The

authors identified 902 human accelerated conserved non-coding

sequences (HACNSs). HACNSs 174 and 369 both lay within introns

of AUTS2 (Figure 2). With the publication of the draft sequence of

the Neanderthal genome in 2011, it was found that the first half of

AUTS2 displayed the strongest statistical signal in a genomic screen

differentiating modern humans from Neanderthals (Figure 2) [68].

This region contains 293 consecutive SNPs where only ancestral

alleles were observed in the Neanderthals, only two of which are

coding variants [a G to C non-synonymous substitution at

chr7:68,702,743 (hg18) only in the Han Chinese and a C to T

synonymous change at chr7:68,702,866 (hg18) within the Yoruba

and Melanesian populations]. Other regions that were found to have

the most significant human-Neanderthal changes also include

genes that are involved in cognition and social interaction, including

dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 1A

(DYRK1A), neuregulin 3 (NRG3) and Ca2+-dependent secretion

activator 2 (CADPS2) [68]. The authors conclude that multiple genes

involved in cognitive development were positively selected during

the evolution of modern humans [68]. Taken together, these studies

suggest that significant changes in AUTS2 occurred specifically in

modern humans and it is conceivable, based on the neurological

role that this gene plays, that these changes could lead to cognitive

traits specific to humans.

Review Trends in Genetics October 2013, Vol. 29, No. 10

606

Page 55: Trends in genetics_-_october_2013

R01HG005058, National Institute of Child Health and Human Develop-ment (NICHD) grant number R01HD059862, and National Institute ofNeurological Disorders and Stroke (NINDS) grant numberR01NS079231. N.O. is also supported in part by a Dennis Weatherstonepre-doctoral fellowship from Autism Speaks.

References1 Fleischhacker, W.W. and Brooks, D.J. (2006) Neurodevelopmental

Disorders, Springer2 Baio, J. et al. (2012) Prevalence of autism spectrum disorders – autism

and developmental disabilities monitoring network, 14 sites, UnitedStates, 2008. MMWR Surveill. Summ. 61, 1–19

3 Risch, N. et al. (1999) A genomic screen of autism: evidence for amultilocus etiology. Am. J. Hum. Genet. 65, 493–507

4 Geschwind, D.H. (2009) Advances in autism. Annu. Rev. Med. 60,367–380

5 Abrahams, B.S. and Geschwind, D.H. (2008) Advances in autismgenetics: on the threshold of a new neurobiology. Nat. Rev. Genet. 9,341–355

6 Sultana, R. et al. (2002) Identification of a novel gene on chromosome7q11.2 interrupted by a translocation breakpoint in a pair of autistictwins. Genomics 80, 129–134

7 Martens, M.A. et al. (2008) Research review: Williams syndrome: acritical review of the cognitive, behavioral, and neuroanatomicalphenotype. J. Child Psychol. Psychiatry 49, 576–608

8 Kalscheuer, V.M. et al. (2007) Mutations in autism susceptibilitycandidate 2 (AUTS2) in patients with mental retardation. Hum.Genet. 121, 501–509

9 Poeck, B. et al. (2008) Locomotor control by the central complex inDrosophila – an analysis of the tay bridge mutant. Dev. Neurobiol. 68,1046–1058

10 Salichs, E. et al. (2009) Genome-wide analysis of histidine repeatsreveals their role in the localization of human proteins to the nuclearspeckles compartment. PLoS Genet. 5, e1000397

11 Lamond, A.I. and Spector, D.L. (2003) Nuclear speckles: a model fornuclear organelles. Nat. Rev. Mol. Cell Biol. 4, 605–612

12 Bedogni, F. et al. (2010) Autism susceptibility candidate 2 (Auts2)encodes a nuclear protein expressed in developing brain regionsimplicated in autism neuropathology. Gene Expr. Patterns 10, 9–15

13 Lepagnol-Bestel, A-M. et al. (2008) SLC25A12 expression is associatedwith neurite outgrowth and is upregulated in the prefrontal cortex ofautistic subjects. Mol. Psychiatry 13, 385–397

14 Zhang, Y.E. et al. (2011) Accelerated recruitment of new braindevelopment genes into the human genome. PLoS Biol. 9, e1001179

15 Wagner, A.H. et al. (2013) Exon-level expression profiling of oculartissues. Exp. Eye Res. 111, 105–111

16 Ameur, A. et al. (2011) Total RNA sequencing reveals nascenttranscription and widespread co-transcriptional splicing in thehuman brain. Nat. Struct. Mol. Biol. 18, 1435–1440

17 Oksenberg, N. et al. (2013) Function and regulation of AUTS2, a geneimplicated in autism and human evolution. PLoS Genet. 9, e1003221

18 Pinto, D. et al. (2010) Functional impact of global rare copy numbervariation in autism spectrum disorders. Nature 466, 368–372

19 Bakkaloglu, B. et al. (2008) molecular cytogenetic analysis andresequencing of contactin associated protein-like 2 in autismspectrum disorders. Am. J. Hum. Genet. 82, 165–173

20 Huang, X-L. et al. (2010) A de novo balanced translocation breakpointtruncating the autism susceptibility candidate 2 (AUTS2) gene in apatient with autism. Am. J. Med. Genet. A 152A, 2112–2114

21 Glessner, J.T. et al. (2009) Autism genome-wide copy number variationreveals ubiquitin and neuronal genes. Nature 459, 569–573

22 Ben-David, E. et al. (2011) Identification of a functional rare variant inautism using genome-wide screen for monoallelic expression. Hum.Mol. Genet. 20, 3632–3641

23 Girirajan, S. et al. (2011) Relative burden of large CNVs on a range ofneurodevelopmental phenotypes. PLoS Genet. 7, e1002334

24 Talkowski, M.E. et al. (2012) Sequencing chromosomal abnormalitiesreveals neurodevelopmental loci that confer risk across diagnosticboundaries. Cell 149, 525–537

25 Nagamani, S.C.S. et al. (2013) Detection of copy-number variation inAUTS2 gene by targeted exonic array CGH in patients withdevelopmental delay and autistic spectrum disorders. Eur. J. Hum.Genet. 21, 1–4

26 Beunders, G. et al. (2013) Exonic deletions in AUTS2 cause a syndromicform of intellectual disability and suggest a critical role for the CTerminus. Am. J. Hum. Genet. 92, 210–220

27 Girirajan, S. et al. (2013) Global increases in both common and rare copynumber load associated with autism. Hum. Mol. Genet. 22, 2870–2880

28 Cusco, I. et al. (2009) Autism-specific copy number variants furtherimplicate the phosphatidylinositol signaling pathway and theglutamatergic synapse in the etiology of the disorder. Hum. Mol.Genet. 18, 1795–1804

29 Tropeano, M. et al. (2013) Male-biased autosomal effect of 16p13.11copy number variation in neurodevelopmental disorders. PLoS ONE 8,e61365

30 Jolley, A. et al. (2013) De novo intragenic deletion of the autismsusceptibility candidate 2 (AUTS2) gene in a patient withdevelopmental delay: a case report and literature review. Am. J.Med. Genet. A 161, 1508–1512

31 Redon, R. et al. (2006) Global variation in copy number in the humangenome. Nature 444, 444–454

32 O’Roak, B.J. et al. (2012) Sporadic autism exomes reveal a highlyinterconnected protein network of de novo mutations. Nature 485,246–250

33 Sanders, S.J. et al. (2012) De novo mutations revealed by whole-exomesequencing are strongly associated with autism. Nature 485, 237–241

34 O’Roak, B.J. et al. (2011) Exome sequencing in sporadic autismspectrum disorders identifies severe de novo mutations. Nat. Genet.43, 585–589

35 Chahrour, M.H. et al. (2012) Whole-exome sequencing andhomozygosity analysis implicate depolarization-regulated neuronalgenes in autism. PLoS Genet. 8, e1002635

36 Flatscher-Bader, T. et al. (2011) Increased de novo copy numbervariants in the offspring of older males. Transl. Psychiatry 1, e34

37 Arlt, M. and Ozdemir, A. (2011) Hydroxyurea induces de novo copynumber variants in human cells. Proc. Natl. Acad. Sci. U.S.A. 108,17360–17365

38 Arlt, M.F. et al. (2012) De novo CNV formation in mouse embryonicstem cells occurs in the absence of Xrcc4-dependent nonhomologousend joining. PLoS Genet. 8, e1002981

39 Sebat, J. et al. (2007) Strong association of de novo copy numbermutations with autism. Science 316, 445–449

40 Wong, C. et al. (2013) Methylomic analysis of monozygotic twinsdiscordant for autism spectrum disorder and related behaviouraltraits. Mol. Psychiatry http://dx.doi.org/10.1038/mp.2013.41

41 Philibert, R.A. et al. (2007) Transcriptional profiling of subjects fromthe Iowa adoption studies. Am. J. Med. Genet. B: Neuropsychiatr.Genet. 144B, 683–690

42 Schumann, G. et al. (2011) Genome-wide association and geneticfunctional studies identify autism susceptibility candidate 2 gene(AUTS2) in the regulation of alcohol consumption. Proc. Natl. Acad.Sci. U.S.A. 108, 7119–7124

43 Liao, D. et al. (2011) Comparative gene expression profiling analysis oflymphoblastoid cells reveals neuron-specific enolase gene (ENO2) as asusceptibility gene of heroin dependence. Addict. Biol. http://dx.doi.org/10.1111/j.1369-1600.2011.00390.x

44 Chen, Y-H. et al. (2013) Genetic analysis of AUTS2 as a susceptibilitygene of heroin dependence. Drug Alcohol Depend. 128, 238–242

45 Hamshere, M.L. et al. (2009) Genetic utility of broadly defined bipolarschizoaffective disorder as a diagnostic concept. Br. J. Psychiatry 195,23–29

46 Hattori, E. et al. (2009) Preliminary genome-wide association study ofbipolar disorder in the Japanese population. Am. J. Med. Genet. B:Neuropsychiatr. Genet. 150B, 1110–1117

47 Lee, H. et al. (2012) A genome-wide association study of seasonalpattern mania identifies NF1A as a possible susceptibility gene forbipolar disorder. J. Affect. Disord. 145, 200–207

48 Mefford, H.C. et al. (2010) Genome-wide copy number variation inepilepsy: novel susceptibility loci in idiopathic generalized and focalepilepsies. PLoS Genet. 6, e1000962

49 Elia, J. et al. (2010) Rare structural variants found in attention-deficithyperactivity disorder are preferentially associated withneurodevelopmental genes. Mol. Psychiatry 15, 637–646

50 Luciano, M. et al. (2011) Whole genome association scan for geneticpolymorphisms influencing information processing speed. Biol.Psychol. 86, 193–202

Review Trends in Genetics October 2013, Vol. 29, No. 10

607

Page 56: Trends in genetics_-_october_2013

51 Chojnicka, I. et al. (2013) Possible association between suicidecommitted under influence of ethanol and a variant in the AUTS2gene. PLoS ONE 8, e57199

52 Gilbert, S.F. (2000) Developmental Biology (6th edn), SinauerAssociates

53 Lin, A. et al. (2010) A genome-wide map of human genetic interactionsinferred from radiation hybrid genotypes. Genome Res. 20, 1122–1132

54 William, D. et al. (2007) Identification of oscillatory genes insomitogenesis from functional genomic analysis of a humanmesenchymal stem cell model. Dev. Biol. 305, 172–186

55 Dequeant, M-L. et al. (2006) A complex oscillating network ofsignaling genes underlies the mouse segmentation clock. Science314, 1595–1598

56 Hashimoto-Torii, K. et al. (2008) Interaction between Reelin and Notchsignaling regulates neuronal migration in the cerebral cortex. Neuron60, 273–284

57 Wang, G-S. et al. (2004) Transcriptional modification by a CASK-interacting nucleosome assembly protein. Neuron 42, 113–128

58 Ben-Shachar, S. et al. (2009) Mouse models of MeCP2 disorders sharegene expression changes in the cerebellum and hypothalamus. Hum.Mol. Genet. 18, 2431–2442

59 Bedogni, F. et al. (2010) Tbr1 regulates regional and laminar identity ofpostmitotic neurons in developing neocortex. Proc. Natl. Acad. Sci.U.S.A. 107, 13129–13134

60 Srinivasan, K. et al. (2012) A network of genetic repression andderepression specifies projection fates in the developing neocortex.Proc. Natl. Acad. Sci. U.S.A. 109, 19071–19078

61 Young, E.J. et al. (2008) Reduced fear and aggression and alteredserotonin metabolism in Gtf2ird1-targeted mice. Genes Brain Behav. 7,224–234

62 O’Leary, J. and Osborne, L.R. (2011) Global analysis of gene expressionin the developing brain of Gtf2ird1 knockout mice. PLoS ONE 6,e23868

63 Sedaghat, Y. et al. (2012) Genomic analysis of wig-1 pathways. PLoSONE 7, e29429

64 Gao, Z. et al. (2012) PCGF homologs, CBX proteins, and RYBP definefunctionally distinct PRC1 family complexes. Mol. Cell 45, 344–356

65 Cheng, Y. et al. (2013) An eQTL mapping approach reveals that rarevariants in the SEMA5A regulatory network impact autism risk. Hum.Mol. Genet. 22, 2960–2972

66 Pollard, K.S. et al. (2006) Forces shaping the fastest evolving regions inthe human genome. PLoS Genet. 2, e168

67 Prabhakar, S. et al. (2006) Accelerated evolution of conservednoncoding sequences in humans. Science 314, 786

68 Green, R.E. et al. (2010) A draft sequence of the Neandertal genome.Science 328, 710–722

69 Gruszka-Westwood, A.M. et al. (2004) Comparative expressedsequence hybridization studies of high-hyperdiploid childhood acutelymphoblastic leukemia. Genes Chromosomes Cancer 41, 191–202

70 Kawamata, N. et al. (2008) Cloning of genes involved in chromosomaltranslocations by high-resolution single nucleotide polymorphismgenomic microarray. Proc. Natl. Acad. Sci. U.S.A. 105, 11921–11926

71 Coyaud, E. et al. (2010) PAX5–AUTS2 fusion resulting fromt(7;9)(q11.2;p13.2) can now be classified as recurrent in B cell acutelymphoblastic leukemia. Leuk. Res. 34, e323–e325

72 Denk, D. et al. (2012) PAX5-AUTS2: a recurrent fusion gene inchildhood B-cell precursor acute lymphoblastic leukemia. Leuk. Res.36, e178–e181

73 Lener, T. et al. (2006) Expression profiling of aging in the human skin.Exp. Gerontol. 41, 387–397

74 Weir, B. et al. (2007) Characterizing the cancer genome in lungadenocarcinoma. Nature 450, 893–898

75 Penney, K.L. et al. (2010) Genome-wide association study of prostatecancer mortality. Cancer Epidemiol. Biomarkers Prev. 19, 2869–2876

76 Sato, S. et al. (2011) Characterization of porcine autism susceptibilitycandidate 2 as a candidate gene for the number of corpora lutea in pigs.Anim. Reprod. Sci. 126, 211–220

77 Li, R. et al. (2012) Six novel susceptibility loci for early-onsetandrogenetic alopecia and their unexpected association withcommon diseases. PLoS Genet. 8, e1002746

78 Stadler, Z.K. et al. (2012) Rare de novo germline copy-number variationin testicular cancer. Am. J. Hum. Genet. 91, 379–383

79 Pollard, K.S. et al. (2006) An RNA gene expressed during corticaldevelopment evolved rapidly in humans. Nature 443, 167–172

Review Trends in Genetics October 2013, Vol. 29, No. 10

608