frequency and distribution of simple and compound microsatellites in forty-eight human...

7
Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes Avadhesh Kumar Singh a,1 , Chaudhary Mashhood Alam b,1 , Choudhary Sharfuddin b , Safdar Ali a,a Department of Biomedical Sciences, SRCASW, University of Delhi, Vasundhara Enclave, New Delhi 110096, India b Department of Botany, Patna University, Bihar 800005, India article info Article history: Received 18 January 2014 Received in revised form 2 March 2014 Accepted 12 March 2014 Available online 21 March 2014 Keywords: Microsatellites Human papillomavirus Simple sequence repeats Compound microsatellites abstract Simple sequence repeats (SSRs) are tandem-repeated sequences ubiquitously present but differentially distributed across genomes. Present study is a systematic analysis for incidence, composition and com- plexity of different microsatellites in 48 representative Human papillomavirus (HPV) genomes. The anal- ysis revealed a total of 1868 SSRs and 120 cSSRs. However, four genomes (HPV-60, HPV-92, HPV-112 and HPV-136) lacked any cSSR content; while HPV-31 accounted for a maximum of 10 cSSRs. An overall increase in cSSR% with higher dMAX was observed. The SSRs and cSSRs were prevalent in coding regions. Poly(A/T) repeats were significantly more abundant than poly(G/C) repeats possibly due to high (A/T) content of the HPV genomes. Further, higher prevalence of di-nucleotide repeats over tri-nucleotide repeats may be attributed to instability of former because of higher slippage rate. An in-depth study of the satellite sequences would provide an insight into the imperfections and evolution of microsatellites. Ó 2014 Elsevier B.V. All rights reserved. 1. Introduction Human papillomaviruses (HPVs) are small non-enveloped viruses that contain circular, single molecule of double-stranded DNA genome of approximately 8 kb in size (Baker et al., 1991; Sapp et al., 1995). International Committee on the Taxonomy of Viruses (ICTV) has classified HPVs into distinct taxonomic family, the Papillomaviridae-distributed into five genera: Alphapapillomavi- rus, Betapapillomavirus, Gammapapillomavirus, Mupapillomavirus and Nupapillomavirus. Reportedly, HPVs are the long-sought, sexu- ally-transmitted causative agents of cervical cancer (Zur Hausen, 2009), the most prevalent cancer in women, with an annual estimate of 530,000 new cases and over 270,000 deaths globally of which more than 85% of these deaths are in low- and middle- income countries (WHO, 2013). Of the different types of HPVs, 75% cause warts on skin called as cutaneous types of HPV while other 25% are mucosal types (affects mucous membrane)- at least 13 of mucosal HPVs are cancer-causing designated high-risk or oncogenic. HPV-16 and HPV-18 cause 70% of cervical cancers and pre-cancerous cervical lesions (WHO, 2013; Li et al., 2009). To date, around 170 HPV types have been completely sequenced (Chouhy et al., 2013). All HPVs have the same general genome organization, which is functionally divided into three regions and typically contains seven or eight open reading frames (ORFs) (Baker et al., 1991; Sapp et al., 1995). The first, non-coding upstream regulatory region (URR) consists of core promoter along with enhancer and silencer sequences that regulate transcription of ORFs (Apt et al., 1996). The second is the early (E) region encod- ing for non-structural regulatory proteins E1, E2, E4, E5, E6 and E7, which are involved in viral genome replication, transcription, transformation and oncogenesis. The third is late (L) region, encod- ing structural proteins L1 (major capsid) and L2 (minor capsid) (Baker et al., 1991; Sapp et al., 1995). URR region is located be- tween the early and the late regions (Longworth and Laimins, 2004), and contains the highest degree of genomic diversity (Apt et al., 1996). Different genotypes of HPVs are defined by variations in genomic sequence (>10%) in the E6, E7 and L1 ORFs (De Villiers, 2013; Bernard et al., 2013). The potency of high-risk HPV infections (HPV-16, HPV-18 and HPV-31) to progress to malignancy is attrib- uted to the expression of the E6 and E7 oncogenes owing to their strong ability to degrade tumor suppressors p53 and retinoblas- toma (RB) proteins in host, respectively, which is lacking in low-risk HPVs (HPV-6 and HPV-11) (Scheffner et al., 1990; McLaughlin-Drubin and Münger, 2009; Howie et al., 2009). http://dx.doi.org/10.1016/j.meegid.2014.03.010 1567-1348/Ó 2014 Elsevier B.V. All rights reserved. Abbreviations: HPV, Human papillomaviruses; ICTV, International Committee on the Taxonomy of Viruses; ORF, open reading frames; URR, upstream regulatory region; SSR, simple sequence repeats; NCBI, National Center for Biotechnology Information; IMEx, imperfect microsatellite extraction; RA, relative abundance; RD, relative density. Corresponding author. Tel.: +91 11 22623503; fax: +91 11 22623504. E-mail addresses: [email protected], [email protected] (S. Ali). 1 Both authors contributed equally. Infection, Genetics and Evolution 24 (2014) 92–98 Contents lists available at ScienceDirect Infection, Genetics and Evolution journal homepage: www.elsevier.com/locate/meegid

Upload: safdar

Post on 25-Dec-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

Infection, Genetics and Evolution 24 (2014) 92–98

Contents lists available at ScienceDirect

Infection, Genetics and Evolution

journal homepage: www.elsevier .com/locate /meegid

Frequency and distribution of simple and compound microsatellitesin forty-eight Human papillomavirus (HPV) genomes

http://dx.doi.org/10.1016/j.meegid.2014.03.0101567-1348/� 2014 Elsevier B.V. All rights reserved.

Abbreviations: HPV, Human papillomaviruses; ICTV, International Committee onthe Taxonomy of Viruses; ORF, open reading frames; URR, upstream regulatoryregion; SSR, simple sequence repeats; NCBI, National Center for BiotechnologyInformation; IMEx, imperfect microsatellite extraction; RA, relative abundance; RD,relative density.⇑ Corresponding author. Tel.: +91 11 22623503; fax: +91 11 22623504.

E-mail addresses: [email protected], [email protected] (S. Ali).1 Both authors contributed equally.

Avadhesh Kumar Singh a,1, Chaudhary Mashhood Alam b,1, Choudhary Sharfuddin b, Safdar Ali a,⇑a Department of Biomedical Sciences, SRCASW, University of Delhi, Vasundhara Enclave, New Delhi 110096, Indiab Department of Botany, Patna University, Bihar 800005, India

a r t i c l e i n f o a b s t r a c t

Article history:Received 18 January 2014Received in revised form 2 March 2014Accepted 12 March 2014Available online 21 March 2014

Keywords:MicrosatellitesHuman papillomavirusSimple sequence repeatsCompound microsatellites

Simple sequence repeats (SSRs) are tandem-repeated sequences ubiquitously present but differentiallydistributed across genomes. Present study is a systematic analysis for incidence, composition and com-plexity of different microsatellites in 48 representative Human papillomavirus (HPV) genomes. The anal-ysis revealed a total of 1868 SSRs and 120 cSSRs. However, four genomes (HPV-60, HPV-92, HPV-112 andHPV-136) lacked any cSSR content; while HPV-31 accounted for a maximum of 10 cSSRs. An overallincrease in cSSR% with higher dMAX was observed. The SSRs and cSSRs were prevalent in coding regions.Poly(A/T) repeats were significantly more abundant than poly(G/C) repeats possibly due to high (A/T)content of the HPV genomes. Further, higher prevalence of di-nucleotide repeats over tri-nucleotiderepeats may be attributed to instability of former because of higher slippage rate. An in-depth study ofthe satellite sequences would provide an insight into the imperfections and evolution of microsatellites.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction

Human papillomaviruses (HPVs) are small non-envelopedviruses that contain circular, single molecule of double-strandedDNA genome of approximately 8 kb in size (Baker et al., 1991; Sappet al., 1995). International Committee on the Taxonomy of Viruses(ICTV) has classified HPVs into distinct taxonomic family, thePapillomaviridae-distributed into five genera: Alphapapillomavi-rus, Betapapillomavirus, Gammapapillomavirus, Mupapillomavirusand Nupapillomavirus. Reportedly, HPVs are the long-sought, sexu-ally-transmitted causative agents of cervical cancer (Zur Hausen,2009), the most prevalent cancer in women, with an annualestimate of 530,000 new cases and over 270,000 deaths globallyof which more than 85% of these deaths are in low- and middle-income countries (WHO, 2013). Of the different types of HPVs,75% cause warts on skin called as cutaneous types of HPV whileother 25% are mucosal types (affects mucous membrane)- at least13 of mucosal HPVs are cancer-causing designated high-risk or

oncogenic. HPV-16 and HPV-18 cause 70% of cervical cancers andpre-cancerous cervical lesions (WHO, 2013; Li et al., 2009).

To date, around 170 HPV types have been completelysequenced (Chouhy et al., 2013). All HPVs have the same generalgenome organization, which is functionally divided into threeregions and typically contains seven or eight open reading frames(ORFs) (Baker et al., 1991; Sapp et al., 1995). The first, non-codingupstream regulatory region (URR) consists of core promoter alongwith enhancer and silencer sequences that regulate transcriptionof ORFs (Apt et al., 1996). The second is the early (E) region encod-ing for non-structural regulatory proteins E1, E2, E4, E5, E6 and E7,which are involved in viral genome replication, transcription,transformation and oncogenesis. The third is late (L) region, encod-ing structural proteins L1 (major capsid) and L2 (minor capsid)(Baker et al., 1991; Sapp et al., 1995). URR region is located be-tween the early and the late regions (Longworth and Laimins,2004), and contains the highest degree of genomic diversity (Aptet al., 1996). Different genotypes of HPVs are defined by variationsin genomic sequence (>10%) in the E6, E7 and L1 ORFs (De Villiers,2013; Bernard et al., 2013). The potency of high-risk HPV infections(HPV-16, HPV-18 and HPV-31) to progress to malignancy is attrib-uted to the expression of the E6 and E7 oncogenes owing to theirstrong ability to degrade tumor suppressors p53 and retinoblas-toma (RB) proteins in host, respectively, which is lacking inlow-risk HPVs (HPV-6 and HPV-11) (Scheffner et al., 1990;McLaughlin-Drubin and Münger, 2009; Howie et al., 2009).

Page 2: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

A.K. Singh et al. / Infection, Genetics and Evolution 24 (2014) 92–98 93

Simple sequence repeats (SSRs), also called as mini- or micro-satellites, are DNA/RNA stretches of 1–6 (or more) bp unit of tan-dem-repeated sequences in a genome (Chen et al., 2009; Chenet al., 2010; Alam et al., 2013, 2014). These sequences are highlydivergent and ubiquitously distributed in viral, prokaryotic andeukaryotic genomes and can occupy both the coding and non-cod-ing sequences (i.e. 30-UTR, 50-UTR, exons and introns) (Alam et al.,2013, 2014; Mrázek et al., 2007; Tóth et al., 2000). SSRs are theproduct of either de novo genesis or adoptive genesis (Kim et al.,2008), and the generation and instability of SSRs are primarilydue to errors of DNA replication and/or repair machinery (Tóthet al., 2000; Katti et al., 2001). Owing to their abundance, ubiquity,simplicity, variation, multi-allelic nature among genomes and po-tential of abundant polymorphisms, SSRs are highly regarded asvaluable source of genetic markers and genome diversity, and havebeen broadly applied in various areas, including determination ofevolutionary relationships, comparative genome analyses andestablishment of genetic maps (Pearson et al., 2005; Kashi andKing, 2006; Deback et al., 2009). Variable length of microsatellitesaffects local DNA structure or the encoded proteins (Mrázek et al.,2007) thereby having implication in gene regulation, transcriptionand protein function (Kashi and King, 2006; Usdin, 2008). Also, notuniversally, genome features such as size and GC content influencethe occurrence of microsatellites (Dieringer and Schlötterer, 2003;Coenye and Vandamme, 2005) and the polymorphism therein (Kel-kar et al., 2008). Presumably, because SSRs affect the regulation ofgene activity, chromatin organization, DNA replication, recombina-tion, cell cycle and mismatch repair, their genomic distribution isnon-random (Li et al., 2004).

Presence of interruptions between two or more microsatelliteshas revealed their different types, such as interrupted, pure, com-pound, interrupted compound, complex and interrupted complex(Chambers and MacAvoy, 2000). This study primarily focuses onpure and compound microsatellites (cSSRs, two or more microsatel-lites adjacent to each other). Interestingly, they are more abundantin coding regions than those in non-coding regions in eukaryotes(Tóth et al., 2000; Metzgar et al., 2000) and in some prokaryotes(Li et al., 2004), possibly due to increased selection in coding regions(Ellegren, 2004; Karaoglu et al., 2005), and in viruses due to highcoding density (Chen et al., 2009; Alam et al., 2014). The cSSRs com-prised 4–25% of genomes of Homo sapiens, Macaca mulatta, Musmusculus and Rattus norvegicus, and included some highly polymor-phic compound repeats such as (dCdA) n(dG–dT)n (Weber, 1990;Bull et al., 1999; Kofler et al., 2008). Furthermore, Escherichia coligenomes had a frequency of 1.75–2.85% while those from HIVtype-1 genomes had up to 24.24% cSSRs, suggesting the variationsacross genomes (Chen et al., 2012). An in-depth study of the diver-sifications in satellite sequences would provide insight into theimperfections and evolution of microsatellites.

Although there are accumulating evidences confirming the roleof microsatellites in generating genomic diversity, evolutionaryrelationships and phenotypic changes, such studies are scarce incase of HPVs. Here, we systematically analyzed the incidence, com-position and complexity of different microsatellites in HPV gen-omes that may help understand the functional aspects andadaptation the hosts.

2. Materials and methods

2.1. HPV genome sequences

The whole-genome sequence of 48 randomly-selected HPVswas accessed from National Center for Biotechnology Information(NCBI) GenBank database (http://www.ncbi.nlm.nih.gov/), andexhaustively analyzed for simple and compound microsatellites.

These represented all the five genera as follows; Alphapapillomavi-rus (N = 14), Betapapillomavirus (N = 5), Gammapapillomavirus(N = 19), Mupapillomavirus (N = 2) and Nupapillomavirus (N = 1),and also unclassified HPVs (7). Genome sizes of analyzed HPVs ran-ged from 7100 nucleotides (HPV-48; accession no. U31789) to8033 nucleotides (HPV-90; accession no. AY057438). Relevant fea-tures of these genomes have been summarized in Table 1.

2.2. Retrieving microsatellites and their analyses

A whole-genome search for the distribution of the simple andthe compound microsatellites was performed using the IMEx soft-ware (Mudunuri and Nagarajaram, 2007). Previous reports oneukaryotes and E. coli have elucidated microsatellites with lengthsof 12 nucleotides or more (Tóth et al., 2000), but HPV genomes didnot yield any results following those parameters, possibly due totheir smaller genome size. Subsequently, IMEx software wasexploited using the ‘Advanced Mode’ as previously reported forHIV (Chen et al., 2012), tobamovirus (Alam et al., 2013) and carla-virus (Alam et al., 2014) genomes. Briefly parameters were setusing Type of Repeat: perfect; Repeat Size: all; Minimum RepeatNumber: 6, 3, 3, 3, 3, 3; Maximum distance allowed between anytwo SSRs (dMAX): 10 bp (10–50 bp is used for seven randomlyselected HPV genomes). The other parameters were set as default.cSSRs were not standardized in order to determine realcomposition.

2.3. Statistical analysis

All the simple mathematical calculations were performed usingMicrosoft Office Excel 2010. However, the Pearson correlation coef-ficient (r) was calculated using GraphPad Prism Software, version 5(La Jolla, CA, USA) to evaluate the influence of genome size and GCcontent, if any, on SSRs and cSSRs. A P-value <0.05 was consideredto be significant.

3. Results

3.1. Occurrence of SSRs

The analysis revealed a total of 1868 SSRs unevenly distributedacross all HPV types included in this study (Table 1, SupplementaryTable 1, Fig. 1). Numbers of SSRs per genome ranged from 26 inHPV-131 (accession no. GU117631) to 66 in HPV-31 (accessionno. J04353) (Table 1, Fig. 1A). A highly variant relative abundance(RA) of SSRs was observed that ranged from 3.62 bp/kb (HPV-131) to 8.34 bp/kb (HPV-31) (Table 1, Fig. 1B). Likewise, the rela-tive density (RD) varied from 23.95 bp/kb (HPV-131 to 59.15 bp/kb (HPV-31) (Table 1, Fig. 1C).

3.2. Occurrence of cSSRs

The investigation of HPV genomes resulted in an observation ofa total of 120 cSSRs. Despite high incidence of SSRs, four genomes(HPV-60, HPV-92, HPV-112 and HPV-136) lacked any cSSR con-tent; however, HPV-31 accounted for maximum 10 cSSRs in itsgenome (Table 1, Supplementary Table 2, Fig. 1A). cSSRs in genomeof HPV-31 exhibited maximum RA (1.26 kb/bp) and RD (27.3 bp/kb) whereas four genomes (HPV-60, HPV-92, HPV-112 and HPV-136) lacked any RA or RD (Table 1, Supplementary Table 2,Fig. 1B–C). The percentage of individual microsatellite (SSR) beingthe part of a cSSR (i.e. cSSR%) was zero in HPV-60, HPV-92, HPV-112 and HPV-136 (29, 29, 30 and 37 SSRs, respectively) while itwas the highest (15.15%) in HPV-31 (66 SSRs) (Table 1, Supplemen-tary Table 2, Fig. 1D).

Page 3: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

Table 1Overview of simple and compound microsatellites in HPV genome sequences.

S. No. HPV types Accession no. Genome Size (bp) GC Content SSRa RAb RDc cSSRa cRAb cRDc cSSR%d Die Trie

P1 HPV-1 V01116 7815 40.28 36 4.61 30.71 2 0.26 4.48 5.56 2 0P2 HPV-2 X55964 7860 48.4 50 6.36 44.02 4 0.51 8.40 8.00 4 0P3 HPV-4 X70827 7353 38.53 32 4.35 30.06 2 0.27 8.16 6.25 1 1 TetraP4 HPV-5 M17463 7746 42.38 41 5.29 37.57 3 0.39 7.10 7.32 3 0P5 HPV-6 X00203 7902 40.86 53 6.71 46.44 3 0.38 6.96 5.66 3 0P6 HPV-7 X74463 8027 39.52 55 6.85 46.59 2 0.25 7.23 3.64 1 1 PentaP7 HPV-9 X74464 7434 40.99 34 4.57 33.76 2 0.27 4.98 5.88 2 0P8 HPV-10 X74465 7919 45.86 48 6.06 43.19 4 0.51 9.72 8.33 4 0P9 HPV-16 K02718 7904 36.51 48 6.07 42.89 2 0.25 9.24 4.17 1 1 HexaP10 HPV-18 X05015 7857 40.44 39 4.96 34.36 2 0.25 4.20 5.13 2 0P11 HPV-26 X74472 7855 38.6 56 7.13 50.16 4 0.51 7.89 7.14 4 0P12 HPV-31 J04353 7912 37.11 66 8.34 59.15 10 1.26 27.3 15.15 8 2P13 HPV-32 X74475 7961 40.97 57 7.16 49.74 8 1.00 17.96 14.04 7 1P14 HPV-34 X74476 7723 38.24 54 6.99 47.78 7 0.91 16.31 12.96 7 0P15 HPV-41 X56147 7614 46.93 33 4.33 30.47 2 0.26 4.33 6.06 2 0P16 HPV-48 U31789 7100 36.76 28 3.94 26.20 1 0.14 3.24 3.57 1 0P17 HPV-49 X74480 7560 41.11 31 4.10 31.61 3 0.40 9.52 9.68 2 1P18 HPV-50 U31790 7184 36.83 27 3.76 25.19 1 0.14 3.34 3.70 1 0P19 HPV-53 X74482 7856 40.13 65 8.27 55.50 7 0.89 15.78 10.77 6 1P20 HPV-54 U37488 7759 41.86 55 7.09 48.59 6 0.77 14.05 10.91 6 0P21 HPV-60 U31792 7313 36.96 29 3.97 27.21 0 0.00 0.00 0.00 0 0P22 HPV-61 U31793 7989 40.3 50 6.26 45.94 6 0.75 13.89 12.00 6 0P23 HPV-63 X70828 7348 40.43 31 4.22 29.94 2 0.27 5.17 6.45 2 0P24 HPV-88 EF467176 7326 40.12 34 4.64 34.53 2 0.27 3.82 5.88 2 0P25 HPV-90 AY057438 8033 46.67 45 5.60 38.84 3 0.37 7.72 6.67 3 0P26 HPV-92 AF531420 7461 39.97 29 3.89 25.87 0 0.00 0.00 0.00 0 0P27 HPV-96 AY382779 7438 40.31 32 4.30 29.04 1 0.13 2.82 3.13 1 0P28 HPV-101 DQ080081 7259 43.11 30 4.13 28.52 1 0.14 2.34 3.33 1 0P29 HPV-103 DQ080078 7263 41.57 38 5.23 38.28 3 0.41 7.43 7.89 3 0P30 HPV-108 FM212639 7149 42.62 39 5.46 37.77 2 0.28 5.87 5.13 2 0P31 HPV-109 EU541441 7346 38.29 38 5.17 37.98 1 0.14 2.86 2.63 1 0P32 HPV-112 EU541442 7227 37.53 30 4.15 27.95 0 0.00 0.00 0.00 0 0P33 HPV-116 FJ804072 7184 38.52 29 4.04 28.54 1 0.14 3.34 3.45 1 0P34 HPV-121 GQ845443 7342 37.74 33 4.49 32.28 2 0.27 4.36 6.06 2 0P35 HPV-126 AB646346 7326 38.04 37 5.05 36.72 1 0.14 2.87 2.70 1 0P36 HPV-127 HM011570 7181 36.99 29 4.04 29.66 1 0.14 2.23 3.45 1 0P37 HPV-128 GU225708 7259 35.98 33 4.55 31.27 1 0.14 2.48 3.03 1 0P38 HPV-129 GU233853 7219 37.29 48 6.65 48.34 2 0.28 6.10 4.17 2 0P39 HPV-131 GU117631 7182 37.04 26 3.62 23.95 1 0.14 2.78 3.85 1 0P40 HPV-132 GU117632 7125 37.94 28 3.93 26.81 3 0.42 8.98 10.71 3 0P41 HPV-134 GU117634 7309 38.14 32 4.38 29.55 1 0.14 2.74 3.13 1 0P42 HPV-135 HM999987 7293 36.84 34 4.66 33.59 3 0.41 9.60 8.82 2 1P43 HPV-136 HM999988 7319 38.52 37 5.06 34.84 0 0.00 0.00 0.00 0 0P44 HPV-137 HM999989 7236 37.58 31 4.28 30.68 1 0.14 2.63 3.23 1 0P45 HPV-140 HM999992 7341 39.72 34 4.63 31.88 1 0.14 3.00 2.94 1 0P46 HPV-144 HM999996 7271 38.23 38 5.23 35.62 1 0.14 1.51 2.63 1 0P47 HPV-148 GU129016 7164 37.42 32 4.47 30.57 2 0.28 5.16 6.25 2 0P48 HPV-166 JX413104 7212 38.28 34 4.71 32.31 3 0.42 8.04 8.82 3 0

a Number of simple/compound microsatellites.b Relative density is defined as the total length (bp) contributed by each simple/compound microsatellite per kb of sequence analyzed.c Relative abundance: number of simple/compound microsatellites present per kb of the genome (kb).d cSSRs-% is the percentage of individual microsatellites being part of a compound microsatellite.e Compound microsatellite complexity (number of individual microsatellites in a compound microsatellite).

94 A.K. Singh et al. / Infection, Genetics and Evolution 24 (2014) 92–98

3.3. Varying dMAX and cSSR incidence

dMAX is defined as the maximum distance (threshold) betweenany two simple microsatellites or SSRs to become a potential cSSR(Kofler et al., 2008). Two or more SSRs are recognized as a singlecSSR if the distance between them is 6dMAX. Notably, the valueof dMAX can only be set between 0 and 50 using IMEx software(Mudunuri and Nagarajaram, 2007). To determine the impact ofdMAX, randomly selected seven HPV genomes (HPV-1, HPV-10,HPV-48, HPV-88, HPV-112, HPV-132 and HPV-166) were analyzedfor the number of cSSRs with increasing dMAX. Expectedly, weobserved an overall increase in cSSR% with higher dMAX for allthe seven HPV genomes analyzed. However, cSSR% was constantin HPV-10 and HPV-48 for dMAX 10–30, and in HPV-166 for dMAX30–50 (Fig. 2).

3.4. Genomic parameters and distribution of SSRs/cSSRs

We assessed the possible influence of genome size and GC con-tent on number/RA/RD of SSRs and cSSRs. Genome size of assessedHPV types had a positive and strong influence on number of SSRs(r = 0.7903; P < 0.0001), RA (r = 0.7223; P < 0.0001) and RD(r = 0.7134; P < 0.0001). Genome size showed similar influence onnumber of cSSRs (r = 0.6307; P < 0.0001), and their RA(r = 0.6048; P < 0.0001) and RD (r = 0.5967; P < 0.0001). In contrast,GC content in assessed HPV genomes had no significant correla-tions with number of SSRs (r = 0.2388; P = 0.1021), RA(r = 0.1959; P = 0.1820) and RD (r = 0.2015; P = 0.1697), and alsowith number of cSSRs (r = 0.2096; P = 0.1518), and their RA(r = 0.2002; P = 0.1724) and RD (r = 0.1305; P = 0.3767). The per-centage of individual SSR being the part of a cSSR (cSSR%) was

Page 4: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

Fig. 1. Analysis of SSRs and cSSR in HPV genomes. (A) Incidence (B) relative abundance: SSRs/cSSRs present per kb of genome (C) relative density: total length covered bySSRs/cSSRs per kb of genome (D) cSSR%: number of cSSR/total number of SSR � 100).

A.K. Singh et al. / Infection, Genetics and Evolution 24 (2014) 92–98 95

strongly and positively correlated with genome size (r = 0.4974;P = 0.0003), but not with GC content (r = 0.2384; P = 0.1027) instudied HPV genomes.

3.5. Preferential motif types in HPV genomes

Genomes of HPVs were further analyzed to determine its pref-erential biasing towards a specific microsatellite. Though mono-nucleotide microsatellites were present in all the HPV genomesanalyzed, their frequencies varied from 6 in HPV-32 to a maximumof 20 in HPV-31. Interestingly, poly(A) and poly(T) microsatelliteswere more prevalent over poly(G) and poly(C) microsatellites

(Fig. 3). This might be attributed to the A/T rich nature of theHPV genome. Poly(A) microsatellite varied from 1 (HPV-41 andHPV-48) to 9 (HPV-31 and HPV-128), and poly(T) microsatellitevaried from 0 (HPV-132) to 10 (HPV-31 and HPV-53) (Table 1).

Further, we analyzed the occurrence of six di-nucleotide micro-satellites-AT/TA, AC/CA, AG/GA, CT/TC, CG/GC and GT/TG. Thesemicrosatellites varied across the HPV genomes. AT/TA was themost abundant motif whereas comparative incidence of the leastrepresented CG/GC was �20 times less (Fig. 3, Table 1).

Tri-nucleotide repeats were the third most abundant microsatel-lites in HPV genomes included in this study. Of the 64 triplet repeattypes, GAG coding for glutamic acid was the most frequent followed

Page 5: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

Fig. 2. Frequency of cSSR% (percentage of individual microsatellites being part of a compound microsatellite) in relation to varying dMAX (10–50) across seven randomlyselected HPV genomes.

Fig. 3. Differential composition of mono-nucleotide repeat, di-nucleotide repeat and tri-nucleotide repeat motifs.

Fig. 4. SSRs and cSSRs in coding and non-coding regions. (A) Comparative distributions of SSRs and cSSRs across coding and non-coding regions of HPV genomes. (B)Nucleotide repeat motifs in coding and non-coding regions. (C) Differential contribution of mono-, di- and tri-nucleotide SSR motifs across different protein ORFs.

96 A.K. Singh et al. / Infection, Genetics and Evolution 24 (2014) 92–98

Page 6: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

A.K. Singh et al. / Infection, Genetics and Evolution 24 (2014) 92–98 97

by AGA coding for arginine (Fig. 3, Table 1). Besides, 10 differenttypes of tetra-nucleotide microsatellite motifs occurred in ninedifferent HPV genomes with HPV-16 genome harboring two suchmotifs. While no penta-nucleotide motif was observed, sevenhexa-nucleotide motifs were distributed across five HPV genomes.

3.6. SSR/cSSR distributions in coding/non-coding regions

The distribution frequencies of SSRs and cSSRs showed a pre-dominant bias towards coding regions of HPV genomes accountingfor �80% of SSRs and �67% of cSSRs. Of them, �21% of SSRs waspresent in E1 protein region followed by E2 (�15%), L2 (�15%)and L1 (�14%) protein regions. The occurrence of cSSRs exhibitedthe same pattern as in E1 (�18%), E2 (�14%), L2 (�10%) and L1(�9%) (Fig. 4A).

Further, we compared the distribution of microsatellite motifsin coding vs. non-coding regions. Of the mono-, di- and tri-nucleo-tide repeat motifs, di-nucleotide motifs dominated in coding(�48%) and in non-coding (�70%) regions, followed by mono-(�32% and �22%, respectively) and tri-nucleotides (�20% and�8%, respectively) (Fig. 4B). Mono-nucleotide motifs were preva-lent in E1 protein of HPVs followed by L2 and E2 proteins;di-nucleotide motifs were frequent in non-coding region followedby E1 and E2 proteins. Also, tri-nucleotide motifs dominated theE2, E1 and E4 proteins (Fig. 4C).

4. Discussion

Different HPV genotypes and variants inhabited the earth beforethe emergence of human species (Bernard et al., 2006). Using thehigh fidelity rate of human proofreading machinery, HPVs havemaintained their basic genomic organization for more than 100million years and have remained stable over time with unexpectedmajor variations (Bernard et al., 2006; Xi et al., 2006). HPV geno-types, unlike other viruses, have evolved very slowly diverging withan estimated rete of 10�8 base substitutions per site per year (Chenet al., 2009). To estimate the clinical course of HPV infection anddisease management strategies, a detailed study for evolution ofHPV genotype variations over the time is needed. Concerning this,and owing to the high mutability, microsatellites may be the betterchoice to study genome evolution (Madsen et al., 2008).

In our study we look into the occurrence, abundance, and com-position of SSRs and cSSRs tracts across 48 HPV genomes. A total of1868 SSRs and 120 cSSRs were extracted and they were distributeunevenly across the genomes of HPVs analyzed here. The SSRs inci-dence in genomes of HPVs is proportional to their genome sizewith the HPVs having 26–66 SSRs, higher than HIV isolates(22–48 SSRs) (Chen et al., 2009), carlaviruses (18–42 SSRs) (Alamet al., 2014) or tobamoviruses (11–36 SSRs) (Alam et al., 2013).Noteworthy, genome size of HPV types was found to be influencingpositively to RD, RA of both SSRs and cSSRs, while GC content hadno significant impact.

The analysis of cSSRs revealed some interesting results. Thesecompound microsatellites are reportedly involved in regulation ofgene expression and at functional level of proteins in several species(Kashi and King, 2006; Chen et al., 2011). Although their significancein HPVs is not clear, it suggests the presence of a possible complexregulation at the functional level. Further, the analysis of dMAX(10–50 bp) showed that cSSR% in seven analyzed HPV types in-creased with increase in dMAX, though not consistently in a linearway, suggesting the distribution pattern of SSRs along the particularHPV genome. In HPV-1 and HPV-112 two SSRs are closely locatedhowever in HPV-10 and HPV-48 they are located far apart(>30 bp) and only become the part of cSSR at dMAX 40 bp, and inHPV-166 they are even beyond the 50 bp. Approximately, 92% of

the extracted cSSRs constituted of two motifs only. The largest com-pound microsatellite in HPVs was composed of six (HPV-16) SSRs,whereas it is more than eight in many eukaryotic species. In general,the number of compound microsatellite decreases with increase incomplexity. Interestingly, cSSRs% ranged between 0% and 15.15% inHPVs genomes; while 0–24.24% in HIV-1 genomes, 4–25% in eighteukaryotic genomes (Kofler et al., 2008) and 1.75–2.85% in E. coligenomes (Kruglyak et al., 2000). Distribution of microsatellites inthe viral genomes is organism-specific rather than host-specific.This is supported by the fact that the taxonomy of HPVs shows nocomparable congruence with host taxonomy, and species from thesame lineage may have quite unrelated hosts (Gibbs et al., 2008).Accordingly, we observed that cSSRs from viruses infecting com-mon host do not possess similar number and types of cSSR motifsin their genome (data not shown). Surprisingly, 44 HPV types pos-sess cSSRs (except four who lack any cSSR), showing the importanceof cSSRs in genotype divergence. Owing to higher polymorphism,cSSRs have an enhanced ability of altering gene function than singlemicrosatellite (Chen et al., 2012).

Interestingly, our preliminary data did not show any correlationof occurrence of microsatellites with low- (HPV-6) and high-risk(HPV-16, HPV-18 and HPV-31) HPVs or with cutaneous and muco-sal HPV types. HPV-31 exhibited maximum numbers of SSRs andcSSRs, and high RD, RA and cSSR% while such values were even lessthan HPV-6 in case with HPV-16 and HPV-18. Detailed molecularand mode of infection analysis may answer the differences be-tween these HPV types.

Poly(A/T) repeats were significantly more prevalent thanpoly(G/C) repeats in HPV genomes, concurrent with eukaryoticand prokaryotic genomes (Tóth et al., 2000; Karaoglu et al.,2005), which can be attributed to the high (A/T) content of theHPV genomes (Karaoglu et al., 2005). A prevalence of di-nucleotiderepeats over tri-nucleotide repeats may be attributed to instabilityof former because of higher slippage rate (Katti et al., 2001),suggests a possible role of hosts in the evolution of di-nucleotiderepeats within HPV types. Therefore, occurrence of diverse typesof repeats observed in HPV genomes facilitates genome evolution.

Most of the extracted SSRs and cSSRs were prevalent on codingregions of HPV genomes, particularly in E1 and E2 proteins(Fig. 4A). Di-nucleotide repeat motifs predominated in both codingand non-coding regions (Fig. 4B); however, mono-nucleotide anddi-nucleotide repeat motifs preferentially inhabited the E1 pro-teins and the non-coding regions, respectively (Fig. 4C) while tri-nucleotide motifs occupied E2 proteins. Overall, E1, E2 and non-coding regions were the preferred locations for the SSRs and cSSRs.Presumably, this indicates that HPVs exploit variations in E1 andE2 proteins to facilitate their entry and replication in host genome,keeping their oncogenecity (i.e. E6 and E7) intact.

Though SSRs have been used as a tool to study genome evolu-tion its efficacy for understanding the vastly and rapidly evolvingviral genomes remains largely unexplored, primarily due to inade-quate information. The studies by our group are an attempt tobuild such a viral genome database. Also, the distribution patternsof SSRs can be the basis of identifying and classifying new virusesand their genomes.

A complete understanding of the functional and evolutionaryrole of tandem repeat sequences in viruses is still elusive. However,their ubiquitous presence though with varying frequency and com-plexity across species as well as coding and non-coding regions issuggestive of them being involved in the already established rolesof gene regulation recombination hot spots. But owing to the smal-ler genome size of viruses as compared to prokaryotes and eukary-otes, these tandem repeats are probably more influential in guidingthe evolution of viruses. The diversity of microsatellites in HPVgenomes may be useful for better understanding of their geneticdiversity, evolutionary biology, and strain/genotype demarcations.

Page 7: Frequency and distribution of simple and compound microsatellites in forty-eight Human papillomavirus (HPV) genomes

98 A.K. Singh et al. / Infection, Genetics and Evolution 24 (2014) 92–98

5. Conclusions

HPVs are a group of deadly viruses, and causative agents of sev-eral diseases including cervical cancer. Because of their genomicdiversity, HPVs interact differently with host-cellular mechanismsand this interaction may have significant impact on the clinicalcourse of HPV-driven diseases. Owing to importance of microsatel-lites in genomic diversity, we set out to investigate the incidence,composition and complexity of different microsatellites in 48 rep-resentative HPV genomes. These microsatellites preferentially har-bored the coding regions, particularly in the E1 and E2 codingsequences. Poly(A/T) repeat motifs were significantly more abun-dant over the others. Nucleotide motifs were more prevalent inE1 and E2 coding sequences than those with other protein se-quences. Occurrence of microsatellites in HPVs was positively cor-related with their genome size.

Acknowledgements

We thank Department of Biomedical Sciences, Shaheed RajguruCollege of Applied Sciences for Women, University of Delhi, NewDelhi, India and Department of Botany, Patna University, Bihar, In-dia for all the financial and infrastructural support provided for thestudy.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.meegid.2014.03.010.

References

Alam, C.M., Singh, A.K., Sharfuddin, C., Ali, S., 2013. In-silico analysis of simple andimperfect microsatellites in diverse tobamovirus genomes. Gene 530, 193–200.

Alam, C.M., Singh, A.K., Sharfuddin, C., Ali, S., 2014. Genome-wide scan for analysisof simple and imperfect microsatellites in diverse carlaviruses. Infect. Genet.Evol. 21, 287–294.

Alam, C.M., Singh, A.K., Sharfuddin, C., Ali, S., 2014. Incidence, complexity anddiversity of simple sequence repeats across potexvirus genomes. Gene 537,189–196.

Apt, D., Watts, R.M., Suske, G., Bernard, H.U., 1996. High Sp1/Sp3 ratios in epithelialcells during epithelial differentiation and cellular transformation correlate withthe activation of the HPV-16 promoter. Virology 224, 281–291.

Baker, T.S., Newcomb, W.W., Olson, N.H., Cowsert, L.M., Olson, C., et al., 1991.Structures of bovine and human papillomaviruses. Analysis by cryoelectronmicroscopy and three-dimensional image reconstruction. Biophys. J. 60, 1445–1456.

Bernard, H.-U., Calleja-Macias, I.E., Dunn, S.T., 2006. Genome variation of humanpapillomavirus types: phylogenetic and medical implications. Int. J. Cancer J.Int. Cancer 118, 1071–1076.

Bernard, E., Pons-Salort, M., Favre, M., Heard, I., Delarocque-Astagneau, E., et al.,2013. Comparing human papillomavirus prevalences in women with normalcytology or invasive cervical cancer to rank genotypes according to theironcogenic potential: a meta-analysis of observational studies. BMC Infect. Dis.13, 373.

Bull, L.N., Pabón-Peña, C.R., Freimer, N.B., 1999. Compound microsatellite repeats:practical and theoretical features. Genome Res. 9, 830–838.

Chambers, G.K., MacAvoy, E.S., 2000. Microsatellites: consensus and controversy.Comp. Biochem. Physiol. B Biochem. Mol. Biol. 126, 455–476.

Chen, M., Tan, Z., Jiang, J., Li, M., Chen, H., et al., 2009. Similar distribution of simplesequence repeats in diverse completed human immunodeficiency virus type 1genomes. FEBS Lett. 583, 2959–2963.

Chen, M., Tan, Z., Zeng, G., Peng, J., 2010. Comprehensive analysis of simplesequence repeats in pre-miRNAs. Mol. Biol. Evol. 27, 2227–2232.

Chen, M., Zeng, G., Tan, Z., Jiang, M., Zhang, J., et al., 2011. Compound microsatellitesin complete Escherichia coli genomes. FEBS Lett. 585, 1072–1076.

Chen, M., Tan, Z., Zeng, G., Zeng, Z., 2012. Differential distribution of compoundmicrosatellites in various human immunodeficiency virus type 1 completegenomes. Infect. Genet. Evol. 12, 1452–1457.

Chouhy, D., Bolatti, E.M., Pérez, G.R., Giri, A.A., 2013. Analysis of the genetic diversityand phylogenetic relationships of putative human papillomavirus types. J. Gen.Virol. 94, 2480–2488.

Coenye, T., Vandamme, P., 2005. Characterization of mononucleotide repeats insequenced prokaryotic genomes. DNA Res. Int. J. Rapid Pub. Rep. GenesGenomes 12, 221–233.

De Villiers, E.-M., 2013. Cross-roads in the classification of papillomaviruses.Virology 445, 2–10.

Deback, C., Boutolleau, D., Depienne, C., Luyt, C.E., Bonnafous, P., et al., 2009.Utilization of microsatellite polymorphism for differentiating herpes simplexvirus type 1 strains. J. Clin. Microbiol. 47, 533–540.

Dieringer, D., Schlötterer, C., 2003. Two distinct modes of microsatellite mutationprocesses: evidence from the complete genomic sequences of nine species.Genome Res. 13, 2242–2251.

Ellegren, H., 2004. Microsatellites: simple sequences with complex evolution. Nat.Rev. Genet. 5, 435–445.

Gibbs, A.J., Ohshima, K., Phillips, M.J., Gibbs, M.J., 2008. The prehistory ofpotyviruses: their initial radiation was during the dawn of agriculture. PLoSOne 3, e2523.

Howie, H.L., Katzenellenbogen, R.A., Galloway, D.A., 2009. Papillomavirus E6proteins. Virology 384, 324–334.

Karaoglu, H., Lee, C.M.Y., Meyer, W., 2005. Survey of simple sequence repeats incompleted fungal genomes. Mol. Biol. Evol. 22, 639–649.

Kashi, Y., King, D.G., 2006. Simple sequence repeats as advantageous mutators inevolution. Trends Genet. TIG 22, 253–259.

Katti, M.V., Ranjekar, P.K., Gupta, V.S., 2001. Differential distribution of simplesequence repeats in eukaryotic genome sequences. Mol. Biol. Evol 18, 1161–1167.

Kelkar, Y.D., Tyekucheva, S., Chiaromonte, F., Makova, K.D., 2008. The genome-widedeterminants of human and chimpanzee microsatellite evolution. Genome Res.18, 30–38.

Kim, T.-S., Booth, J.G., Gauch Jr., H.G., Sun, Q., Park, J., et al., 2008. Simple sequencerepeats in Neurospora crassa: distribution, polymorphism and evolutionaryinference. BMC Genomics 9, 31.

Kofler, R., Schlötterer, C., Luschützky, E., Lelley, T., 2008. Survey of microsatelliteclustering in eight fully sequenced species sheds light on the origin ofcompound microsatellites. BMC Genomics 9, 612.

Kruglyak, S., Durrett, R., Schug, M.D., Aquadro, C.F., 2000. Distribution andabundance of microsatellites in the yeast genome can Be explained by abalance between slippage events and point mutations. Mol. Biol. Evol. 17, 1210–1219.

Li, Y.-C., Korol, A.B., Fahima, T., Nevo, E., 2004. Microsatellites within genes:structure, function, and evolution. Mol. Biol. Evol. 21, 991–1007.

Li, L., Barry, P., Yeh, E., Glaser, C., Schnurr, D., et al., 2009. Identification of a novelhuman gammapapillomavirus species. J. Gen. Virol. 90, 2413–2417.

Longworth, M.S., Laimins, L.A., 2004. Pathogenesis of human papillomaviruses indifferentiating epithelia. Microbiol. Mol. Biol. Rev. MMBR 68, 362–372.

Madsen, B.E., Villesen, P., Wiuf, C., 2008. Short tandem repeats in human exons: atarget for disease mutations. BMC Genomics 9, 410.

McLaughlin-Drubin, M.E., Münger, K., 2009. Oncogenic activities of humanpapillomaviruses. Virus Res. 143, 195–208.

Metzgar, D., Bytof, J., Wills, C., 2000. Selection against frameshift mutations limitsmicrosatellite expansion in coding DNA. Genome Res. 10, 72–80.

Mrázek, J., Guo, X., Shah, A., 2007. Simple sequence repeats in prokaryotic genomes.Proc. Natl. Acad. Sci. USA 104, 8472–8477.

Mudunuri, S.B., Nagarajaram, H.A., 2007. IMEx: imperfect microsatellite extractor.Bioinformatics 23, 1181–1187.

Pearson, C.E., Nichol Edamura, K., Cleary, J.D., 2005. Repeat instability: mechanismsof dynamic mutations. Nat. Rev. Genet. 6, 729–742.

Sapp, M., Volpers, C., Müller, M., Streeck, R.E., 1995. Organization of the major andminor capsid proteins in human papillomavirus type 33 virus-like particles. J.Gen. Virol. 76 (9), 2407–2412.

Scheffner, M., Werness, B.A., Huibregtse, J.M., Levine, A.J., Howley, P.M., 1990. TheE6 oncoprotein encoded by human papillomavirus types 16 and 18 promotesthe degradation of p53. Cell 63, 1129–1136.

Tóth, G., Gáspári, Z., Jurka, J., 2000. Microsatellites in different eukaryotic genomes:survey and analysis. Genome Res. 10, 967–981.

Usdin, K., 2008. The biological effects of simple tandem repeats: lessons from therepeat expansion diseases. Genome Res. 18, 1011–1019.

Weber, J.L., 1990. Informativeness of human (dC–dA)n. (dG–dT)n polymorphisms.Genomics 7, 524–530.

WHO | Human papillomavirus (HPV) and cervical cancer (n.d.), 2013. Available:http://www.who.int/mediacentre/factsheets/fs380/en/. (accessed 15 December2013).

Xi, L.F., Kiviat, N.B., Hildesheim, A., Galloway, D.A., Wheeler, C.M., et al., 2006.Human papillomavirus type 16 and 18 variants: race-related distribution andpersistence. J. Natl. Cancer Inst. 98, 1045–1052.

Zur Hausen, H., 2009. Papillomaviruses in the causation of human cancers – a briefhistorical account. Virology 384, 260–265.