a tale of two platforms: an evaluation of the roche gs ...€¦ · drag a placeholder onto the...

1
RESEARCH POSTER PRESENTATION DESIGN © 2011 www.PosterPresentations.com A tale of two platforms: An evaluation of the Roche GS Junior and Illumina® MiSeq next-generation sequencing instruments for forensic mitochondrial DNA analysis ABSTRACT Brittania J. Bintz, M.S.; Erin S. Burnside, M.S.; Mark R. Wilson, Ph.D. Forensic Science Program, Department of Chemistry and Physics – Western Carolina University Next-generation sequencing (NGS) refers to a suite of technologies that enable cost-effective, rapid generation of large amounts of detailed sequence information from clonal populations of individual template molecules. These methods are proving to be particularly well-suited for mitochondrial DNA analysis, and may provide forensic DNA analysts with a powerful tool that enables deconvolution of mtDNA mixtures. Recently, Illumina® has been working with members of the community to establish a human mtDNA forensic genomics consortium (IFGC) for concerted evaluation of NGS methods for potential use in mtDNA casework and databasing. In June, a set of samples was prepared consisting of quantified buccal extracts from two donors, as well as a series of mixtures of the buccal extracts at defined ratios (5, 2, 1 and 0.5%). This sample set has been distributed to participating IFGC laboratories for sequencing on multiple NGS platforms including the Ion PGM™, Roche GS Junior, and Illumina® MiSeq, to enable a cross-laboratory comparison of sequencing methods using identical samples. In our laboratory, the samples were sequenced on both the Roche GS Junior, and Illumina® MiSeq NGS platforms. Libraries from hypervariable regions 1 and 2 (HV and HV2) were sequenced on the Roche GS Junior using an amplicon library preparation approach where PCR primers were designed to include required adaptors and multiplexing indices. For sequencing on the Illumina® MiSeq, libraries were prepared using Nextera® XT in which two large amplicons covering the whole mtGenome as well as HV1 and HV2 amplicons were randomly fragmented, and adapters and indices incorporated enzymatically. The resulting data was analyzed using CLC Genomics Workbench software and variant calls were compared. The Illumina® MiSeq resulted in significantly higher coverage across all positions sequenced, giving rise to higher certainty with low-level variant calls. Further, the MiSeq allowed for detection of minor variants in all mixtures where the majority of minor variants were undetected in the 0.5% mixture with the Roche GS Junior. Finally, data from the MiSeq showed lower background noise overall, especially in homopolymeric regions when compared to data from the GS Junior. The Illumina® MiSeq offers a streamlined enzymatic library preparation approach, higher-throughput and more accurate variant detection and basecalling than the Roche GS Junior. As a result, we feel that the MiSeq is better suited for forensic mtDNA analysis in both casework and databasing laboratories. SAMPLE SET CLC Genomics Workbench v6.5 was used for all data analyses. Raw data files (.SFF files) from the Roche GS Junior were uploaded into CLC Genomics Workbench, and demultiplexed using the software. Fastq files demultiplexed during secondary analysis on the Illumina® MiSeq were also uploaded. All sample files were analyzed using the same pipeline. The data was initially mapped to the revised Cambridge reference sequence (rCRS) 3 using a local alignment option. Variant calling was performed using the quality-based variant detection method with a 0.1% minor variant detection threshold to enable capture of a majority of sites showing variability. Resulting variant tables were exported as tab delimited text files, and uploaded into the Galaxy 4 open-source cloud computing environment. A count application within Galaxy was applied to the data set to assist with categorization of unexpected variants. Full analysis parameters, and raw data files are available upon request. DATA ANALYSIS – CLC GENOMICS WORKBENCH UNEXPECTED VARIANTS IN WHOLE GENOME DATA - SUMMARY REFERENCES 1. M.F. Kavlick, H.S. Lawrence, R.T. Merritt, C. Fisher, A. Isenberg, J.M. Robertson, B. Budowle. Quantification of human mitochondrial DNA using synthesized DNA standards. J Forensic Sci. 2011:56(6):1457-63. 2. R.M. Andrews, I. Kubacka, P.F. Chinnery, R.N. Lightowlers, D.M. Turnbull, N. Howell. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genetics. 1999:23:147. 3. H. Stawski, B.J. Bintz, E.S. Burnside, M.R. Wilson. Preparing whole genome human mitochondrial libraries using Illumina® Nextera® XT. Proceedings of AAFS. 2013. 4. Public Galaxy Instance..[http://usegalaxy.org]. Brittania J. Bintz, M.S. Forensic Science Program Western Carolina University 111 Memorial Drive NSB #231 Cullowhee, NC 28723 [email protected] 828.227.3680 SEQUENCING CHEMISTRIES ACKNOWLEDGMENTS CONTACT 001 – CF30 Sole Source 003 – CM54 Sole Source 5% 003-CM54 2% 003-CM54 1% 003-CM54 0.5% 003-CM54 Reagent Blanks 9947A Positive Control 1 NGS LIBRARY PREPARATION hBp://img.medscape.com/ar.cle/738/786/738786fig2.jpg Roche 454 Pyrosequencing Illumina® Reversible Terminator Sequencing Sample Set Sample Preparation: Roche 454 Pyrosequencing Sample Preparation: Illumina® Reversible Terminator Sequencing Amplification of HV1 and HV2 regions using Fusion Primers HV1 and HV2 amplicons from each sample were assigned a single multiplex identifier (MID) for sample parsing prior to data analysis Figure 1: Chemistry and workflows of benchtop NGS instruments. On the left, Roche 454 Pyrosequencing chemistry is presented. Initially, DNA target regions are amplified using PCR with a high-fidelity polymerase, and modified primers containing 454 adapter sequences and multiplexing barcodes 5’ of a template specific sequence. Multiplexing barcodes enable post-run sample parsing, while adapters allow for clonal amplification on the surface of a bead, which is then loaded onto a PicoTiter™ plate (PTP) where sequencing chemistry occurs. Native dNTP’s are flowed iteratively across the surface of the PTP. Release of PPi following incorporation initiates an enzymatic cascade that ends in the ATP catalyzed conversion of luciferin to oxyluciferin, and the production of light. The amount of light produced is directly related to the number of nucleotides in the region being sequenced. DNA target regions are amplified using traditional PCR, and a high-fidelity polymerase. The resulting amplicons are fragmented enzymatically using the Illumina® Nextera® XT tagmentation kit, which also enables incorporation of sequencing adaptors and multiplexing barcodes during a limited cycle PCR step. Libraries are then added to an optically transparent flowcell, where they bind to anchored oligonucleotides complimentary to their incorporated adaptor sequences. Clusters consisting of clonal populations of individual template molecules are generated on the instrument. Sequencing takes place after cluster generation when modified dNTPs with base specific fluors, and blocking groups on their 3’ –OH groups are flowed across the flowcell. After each cycle, LED lights excite nucleotide specific fluors, and the software images the surface of the flowcell to catalogue the specific base incorporated at each cluster. 001HV MiSeq 003HV MiSeq 001WG MiSeq 003WG MiSeq 001HV Roche 003HV Roche 9947A HV MiSeq 9947A WG MiSeq 9947A HV Roche 92 93 94 95 96 97 98 99 100 101 102 Average frequency (%) Average Frequencies of Expected Variants for Sole Source Samples 98.9% 1.6 99.5% 0.5 0 20 40 60 80 100 120 Variant Frequency (%) Frequencies of Expected Variants in Mixed Samples S S S S Buccal swabs (x20) were obtained from two donors (001-CF30 and 003- CM54) whose whole mtGenome had been previously characterized in our laboratory using Sanger methods. The swabs were collected according to approved IRB protocol. DNA from each set of 20 buccal swabs was extracted independently using the Qiagen DNA Investigator kit surface and buccal swab protocol. A single RB was extracted alongside each set. The resulting extracts from each donor were pooled to create a large volume master extract. Pooled samples were quantified in quintuplicate using a human mtDNA specific real-time PCR assay developed by Mark Kavlick at the F.B.I. 1 Quintuplicate values were averaged after outliers were removed. Averages were used to prepare mixtures of donors in defined ratios of 5, 2, 1 and 0.5%. Donor 003-CM54 was used as the minor contributor in all mixtures. Final mixtures and sole source samples were quantified in quintuplicate using both the Quantifiler® Human kit, and the human mtDNA specific real-time PCR assay mentioned above. All quantified samples were distributed to IFGC laboratories for sequencing as requested. Sequencing on Roche GS Junior Amplification of HV1 and HV2 using traditional F.B.I. primer sets Amplification of whole mtGenome in two long overlapping amplicons2 9,065 kb 11,170 kb UNEXPECTED VARIANTS IN HV DATA – SUMMARY EXPECTED VARIANT FREQUENCIES – SOLE SOURCE SAMPLES EXPECTED VARIANT FREQUENCIES IN HV REGION – MIXED SAMPLES 001HV 003HV 5%003HV 2%003HV 1%003HV 0.5%003HV 9947AHV M R M R M R M R M R M R M R AG; GA 112 56 92 82 94 53 92 57 107 NA 102 51 63 30 CT; TC 117 66 118 73 99 67 107 71 108 NA 103 72 78 46 Transversions 5 15 3 2 3 0 0 2 2 NA 4 1 2 10 Insertions 10 16 10 17 13 15 14 17 14 NA 15 13 9 11 Deletions 15 31 14 37 13 28 14 37 15 NA 13 27 9 27 MNVs (multi- nucleotide variants) 5 3 1 5 1 7 0 3 1 NA 2 3 1 1 Total 264 187 238 216 223 170 227 187 247 NA 239 167 162 125 Average Coverage 2,346 121 15,312 121 10,939 122 10,967 126 14,327 NA 14,151 115 5,459 135 WHOLE GENOME SAMPLES – COVERAGE MAPS Figure 2: Bar graph showing average frequencies for expected variants in HV1 and HV2, detected in sole source samples. The expected variants are those that were previously detected for each donor using Sanger sequencing, and are expected be detected at or near a frequency of 100% unless otherwise noted. Averages of reported frequencies and accompanying standard deviations were calculated for the expected variants of each donor. Error bars represent standard deviations of frequency averages. Donor 001-CF30 is expected to possess a total of 9 variants from the rCRS across the HV region, with varying levels of heteroplasmy at position 16093 (data not shown). This heteroplasmy is contributing to higher standard deviation values for all donor 001 samples. Donor 003-CM54 was expected to possess a total of 14 variants from the rCRS in the HV region. Differences in alignments were often observed at positions 150C, 152T, 16193C and 16195T, expected variants for donor 003-CM54 (figure 4). When this was true, frequencies for these variants were not used to calculate frequency averages and standard deviations (omitted: 150/152-003HV MiSeq; 150/152, 16193/16195-003HV Roche). The 9947A control was expected to possess 5 variants from the rCRS, with no apparent heteroplasmy in the HV region. Figure 3: Bar graph showing average frequencies for expected variants in HV1 and HV2, detected in mixed samples. The expected variants are those that were previously detected for each donor using Sanger sequencing. All variants shared between donors are expected to be detected at or near a frequency of 100%, barring heteroplasmic sites. Conversely, depending on the mixture, the donor specific variant frequencies change (95:5; 98:2, 99:1, and 99.5:0.5% respectively) with donor 003- CM54 representing the minor contributor in all mixtures. In the HV region, donor 001-CF30 and donor 003-CM54 share a total of 5 variants. In addition, donor 001-CF30 possesses 4 unshared variants from the rCRS, and donor 003-CM54 possesses 8. Similar to the data presented in figure 2, heteroplasmy for donor 001-CF30 at position 16093, is contributing to higher than expected standard deviations in frequency averages. Also, frequencies for variants with observed alignment anomalies were omitted from calculations in this data set (omitted: positions 150/152 from all donor 003 samples and positions 16193/16195 in 5%003HV and 1%003HV data generated with the MiSeq). In all MiSeq HV sample data, frequencies of minor contributor variants were higher than expected. This was not observed in data generated using the Roche GS Junior. This was also observed to a much lesser extent in whole genome data also generated using the Illumina® MiSeq (see figure 4). This is likely a result of differences in library preparation, since the same samples were used for all sequencing runs. Additionally, in the 0.5% mixture, all minor contributor variants dropped out (were below the 0.1% minor variant frequency threshold) except 1. This is a direct result of the higher coverage achieved with the Illumina® MiSeq, and shows that the MiSeq is more sensitive than the Roche GS Junior. Table 1: Summary of characterized unexpected variants from HVI and HV2 samples in Roche GS Junior and Illumina® MiSeq data. All unexpected variants were counted using Galaxy open-source cloud computing software. The resulting counts were manually categorized and tabulated. In all cases, when compared to the average coverage calculated for each sample, the number of unexpected variants in Roche GS Junior data sets is large compared to the total number of unexpected variants in Illumina® MiSeq data sets. This indicates that the Roche GS Junior yields more noise than the Illumina® MiSeq. However, additional research is needed to further characterize the unexpected variants as noise. The 9947A control, derived from a human cell line possessed fewer numbers of unexpected variants when compared to the samples obtained from buccal swabs. Figure 5: Coverage maps for Illumina® MiSeq whole genome data (donors 001- CF30 and 003-CM54) obtained from Illumina® MiSeq Reporter secondary analysis software. Coverage across the entire mtGenome was achieved using a long PCR approach developed in our laboratory. Q-scores are high (30) throughout the length of the genome. The authors of this work wish to gratefully acknowledge Illumina® for their continuing support. Special thanks to Cynde Holt, Joe Varlaro, Joe Walsh, Kathryn Stevens, Carey Davis and Tom Richardson. We also wish to thank Western Carolina University for supporting some of this work. EXPECTED VARIANT FREQUENCIES – WHOLE GENOME DATA 0 20 40 60 80 100 120 Variant Frequency (%) Frequencies of Expected Variants in Mixed Whole Genome Samples S=Shared, 001=Donor 001-CF32, 003=Donor 003-CM54 S S S S 003 7.05% 0.91 003 2.67% 0.38 003 1.43% 0.16 003 0.72% 0.05 001WG 003WG 5%003WG 2%003WG 1%003WG 0.5%003WG 9947AWG AG; GA 3384 4064 3428 3341 3269 1775 2972 CT; TC 3330 4010 3477 3400 3255 1776 3116 Transversions 47 324 67 51 89 14 78 Insertions 40 48 43 42 34 36 23 Deletions 210 316 219 180 269 118 203 MNVs (multi- nucleotide variants) 26 26 29 27 39 29 20 Total 7037 8788 7263 7041 6955 3748 6412 Average Coverage 11,477 9,920 10,274 9,270 14,326 14,151 13,294 Table 2: Summary of unexpected variant characterizations for whole genome samples from Illumina® MiSeq data. Average coverage for whole genome samples is high despite being sequenced in the same run with 7 samples consisting of pooled HV amplicons, 21 unrelated samples consisting of 2 pooled amplicons, and controls. The types of variants seen across the whole mtGenome sequenced using the Illumina® MiSeq are consistent. No whole genome data was obtained from the Roche GS Junior, so no direct instrumentation comparisons can be made with this data. Figure 6: Bar graph showing average frequencies for expected variants for whole genome data, detected in mixed samples. All variants shared between donors are expected to be detected at a frequency of 100%. However, depending on the mixture, the donor specific variant frequencies will change (95:5; 98:2, 99:1, and 99.5:0.5% respectively) with donor 003-CM54 representing the minor contributor in all mixtures. Frequency averages and standard deviations were calculated for the expected variants of each donor. Error bars represent standard deviations of frequency averages. Across the genome, donor 001-CF30 and donor 003-CM54 share a total of 21 variants. In addition, donor 001-CF30 possesses 7 unshared variants from the rCRS, and donor 003-CM54 possesses 18. Frequencies for variants with observed alignment anomalies were omitted from calculations in this data set as well (omitted: 150/152 in all samples; 16193/16195-5%003WG ). Furthermore, several data points were not detected in 0.5%003WG. A large potion of data is missing from this set. More research is being conducted to determine the cause. Overall, the frequencies of the expected variants seem to more accurate when using our whole genome approach, than when using an small amplicon approach (figure 3). CONCLUSIONS The Illumina® MiSeq allows for rapid enzymatic library preparation, higher throughput, and longer read-lengths than the Roche GS Junior. Much higher coverage is achieved per sample when using the Illumina® MiSeq versus the Roche GS Junior, even with additional whole genome samples included in the MiSeq run. As a result, we feel that the MiSeq can result in higher sensitivity depending on the design of the run, and lead to detection of minor variants at a lower frequency than the GS Junior. Additionally, it appears that the Roche GS Junior is more prone to noise than the Illumina® MiSeq, especially in areas containing homopolymers. However, this observation is not novel. Additional work is being done to reanalyze the data from the Roche GS Junior with filtering mechanisms designed to remove sequencing artifacts associated with homopolymers. Also, unexpected variants which were labeled as “noise” for this work, are being further studied to identify positions that may be consistent with true biological variation. The types of unexpected variants observed in data obtained using both instruments is consistent across samples sequenced. Overall, we feel that the Illumina® MiSeq is well-suited for forensic mtDNA analysis though additional validation studies are required to move this technique into the crime laboratory. HV1 HV2 15997 16391 48 409 Nextera® XT – Enzymatic Library Preparation Amplicons are pooled for each sample Sequencing on Illumina® MiSeq – 2x151 v2 kit Roche GS Junior HV amplicons 2 sole source samples 4 mixed samples 1 positive control (9947A) 2 reagent blanks =7 samples + 2 reagent blanks Illumina® MiSeq 2 sole source samples 4 mixed samples 1 positive control 2 reagent blanks 2 sole source samples 4 mixed samples 1 positive control 2 reagent blanks =35 samples + 4 reagent blanks Samples Sequenced on each NGS run Library preparation (emPCR and cleanup) 5%003HV MiSeq 5%003HV Roche 2%003HV Roche 2%003HV MiSeq 0.5%003HV Roche 0.5%003HV MiSeq 1%003HV MiSeq S S S 001 001 001 001 001 001 001 003 12.9% 4.52 99.82% 0.12 85.1% 3.56 99.2% 1.44 94.9% 1.75 003 2.57% 0.07 99.8% 0.23 92.13% 3.52 003 3.45% 1.15 99.5% 0.65 96.1% 1.92 003 1.74% 0.59 99.75% 0.28 94.5% 3.41 003 2.51% 0.38 99.8% 0.18 94.8% 4.52 003 1.59% 0.55 99.7% 0.26 97.9% 1.88 003 0.1% NA S=Shared, 001=Donor 001-CF32, 003=Donor 003-CM54 001 001 001 001 99.4% 0.64 90.3% 3.08 99.2% 0.86 94.6% 3.36 99.3% 0.81 96.2% 3.56 98.8% 2.37 96.7% 3.27 Donor ID Variant frequency Standard devia.on 5%003WG 2%003WG 1%003WG 0.5%003WG 99.3% 3.13 99.6% 0.4 99.0% 1.72 99.3% 0.57 99.5% 0.4 99.8% 0.09 99.8% 0.2 Average frequency Standard devia.on ALIGNMENT DIFFERENCES rCRS TCATCCTATTATT 003SANGER TCACCTTATTATT rCRS TCA_TCCTATTATT 003CLC TCATTCC_ATTATT rCRS position 150 152 rCRS position 148 152 150: CT Transition 152: TC Transition 148: Insertion T 152: Deletion T Sanger data with positions 150 and 152 highlighted NGS data with alignment highlighted Figure 4: Alignment differences in Sanger and NGS data at positions 150 and 152 for donor 003-CM54. Differences in alignment at positions 150 and 152 were detected across all samples containing DNA from donor 003-CM54. Alignment discrepancies between bioinformatics algorithms must be considered when implementing NGS into databasing and casework laboratories. Note: No HV1 data was obtained for sample 1%003HV Roche. Data for this sample is not shown.

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A tale of two platforms: An evaluation of the Roche GS ...€¦ · Drag a placeholder onto the poster area, size it, and click it to edit. generated on the instrument. Sequencing

RESEARCH POSTER PRESENTATION DESIGN © 2011

www.PosterPresentations.com

QUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 36”x72” professional poster. It will save you valuable time placing titles, subtitles, text, and graphics. Use it to create your presentation. Then send it to PosterPresentations.com for premium quality, same day affordable printing. We provide a series of online tutorials that will guide you through the poster design process and answer your poster production questions. View our online tutorials at: http://bit.ly/Poster_creation_help (copy and paste the link into your web browser). For assistance and to order your printed poster call PosterPresentations.com at 1.866.649.3004

Object Placeholders

Use the placeholders provided below to add new elements to your poster: Drag a placeholder onto the poster area, size it, and click it to edit. Section Header placeholder Use section headers to separate topics or concepts within your presentation. Text placeholder Move this preformatted text placeholder to the poster to add a new body of text. Picture placeholder Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the poster.

QUICK TIPS (--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Using the template

Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 50% the size of the final poster. All text and graphics will be printed at 200% their size. To see what your poster will look like when printed, set the zoom to 200% and evaluate the quality of all your graphics before you submit your poster for printing. Using the placeholders To add text to this template click inside a placeholder and type in or paste your text. To move a placeholder, click on it once (to select it), place your cursor on its frame and your cursor will change to this symbol: Then, click once and drag it to its new location where you can resize it as needed. Additional placeholders can be found on the left side of this template. Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on “Layout” to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER. Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To make the text fit better in the cells of an imported table, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25 Modifying the color scheme To change the color scheme of this template go to the “Design” menu and click on “Colors”. You can choose from the provide color combinations or you can create your own.

©  2011  PosterPresenta.ons.com          2117  Fourth  Street  ,  Unit  C          Berkeley    CA    94710          [email protected]  

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

A tale of two platforms: An evaluation of the Roche GS Junior and Illumina® MiSeq next-generation sequencing instruments for forensic mitochondrial DNA analysis

ABSTRACT

Brittania J. Bintz, M.S.; Erin S. Burnside, M.S.; Mark R. Wilson, Ph.D. Forensic Science Program, Department of Chemistry and Physics – Western Carolina University

Next-generation sequencing (NGS) refers to a suite of technologies that enable cost-effective, rapid generation of large amounts of detailed sequence information from clonal populations of individual template molecules. These methods are proving to be particularly well-suited for mitochondrial DNA analysis, and may provide forensic DNA analysts with a powerful tool that enables deconvolution of mtDNA mixtures. Recently, Illumina® has been working with members of the community to establish a human mtDNA forensic genomics consortium (IFGC) for concerted evaluation of NGS methods for potential use in mtDNA casework and databasing. In June, a set of samples was prepared consisting of quantified buccal extracts from two donors, as well as a series of mixtures of the buccal extracts at defined ratios (5, 2, 1 and 0.5%). This sample set has been distributed to participating IFGC laboratories for sequencing on multiple NGS platforms including the Ion PGM™, Roche GS Junior, and Illumina® MiSeq, to enable a cross-laboratory comparison of sequencing methods using identical samples. In our laboratory, the samples were sequenced on both the Roche GS Junior, and Illumina® MiSeq NGS platforms. Libraries from hypervariable regions 1 and 2 (HV and HV2) were sequenced on the Roche GS Junior using an amplicon library preparation approach where PCR primers were designed to include required adaptors and multiplexing indices. For sequencing on the Illumina® MiSeq, libraries were prepared using Nextera® XT in which two large amplicons covering the whole mtGenome as well as HV1 and HV2 amplicons were randomly fragmented, and adapters and indices incorporated enzymatically. The resulting data was analyzed using CLC Genomics Workbench software and variant calls were compared. The Illumina® MiSeq resulted in significantly higher coverage across all positions sequenced, giving rise to higher certainty with low-level variant calls. Further, the MiSeq allowed for detection of minor variants in all mixtures where the majority of minor variants were undetected in the 0.5% mixture with the Roche GS Junior. Finally, data from the MiSeq showed lower background noise overall, especially in homopolymeric regions when compared to data from the GS Junior. The Illumina® MiSeq offers a streamlined enzymatic library preparation approach, higher-throughput and more accurate variant detection and basecalling than the Roche GS Junior. As a result, we feel that the MiSeq is better suited for forensic mtDNA analysis in both casework and databasing laboratories.

SAMPLE SET

CLC Genomics Workbench v6.5 was used for all data analyses. Raw data files (.SFF files) from the Roche GS Junior were uploaded into CLC Genomics Workbench, and demultiplexed using the software. Fastq files demultiplexed during secondary analysis on the Illumina® MiSeq were also uploaded. All sample files were analyzed using the same pipeline. The data was initially mapped to the revised Cambridge reference sequence (rCRS)3 using a local alignment option. Variant calling was performed using the quality-based variant detection method with a 0.1% minor variant detection threshold to enable capture of a majority of sites showing variability. Resulting variant tables were exported as tab delimited text files, and uploaded into the Galaxy4 open-source cloud computing environment. A count application within Galaxy was applied to the data set to assist with categorization of unexpected variants. Full analysis parameters, and raw data files are available upon request.

DATA ANALYSIS – CLC GENOMICS WORKBENCH

UNEXPECTED VARIANTS IN WHOLE GENOME DATA - SUMMARY

REFERENCES 1.  M.F. Kavlick, H.S. Lawrence, R.T. Merritt, C. Fisher, A. Isenberg, J.M. Robertson, B. Budowle. Quantification of

human mitochondrial DNA using synthesized DNA standards. J Forensic Sci. 2011:56(6):1457-63. 2.  R.M. Andrews, I. Kubacka, P.F. Chinnery, R.N. Lightowlers, D.M. Turnbull, N. Howell. Reanalysis and revision of the

Cambridge reference sequence for human mitochondrial DNA. Nat Genetics. 1999:23:147. 3.  H. Stawski, B.J. Bintz, E.S. Burnside, M.R. Wilson. Preparing whole genome human mitochondrial libraries using

Illumina® Nextera® XT. Proceedings of AAFS. 2013. 4.  Public Galaxy Instance..[http://usegalaxy.org].

Brittania J. Bintz, M.S. Forensic Science Program

Western Carolina University 111 Memorial Drive

NSB #231 Cullowhee, NC 28723

[email protected] 828.227.3680

SEQUENCING CHEMISTRIES

ACKNOWLEDGMENTS

CONTACT

001 – CF30 Sole Source

003 – CM54 Sole Source

5% 003-CM54

2% 003-CM54

1% 003-CM54

0.5% 003-CM54

Reagent Blanks

9947A Positive Control

1 NGS LIBRARY PREPARATION

hBp://img.medscape.com/ar.cle/738/786/738786-­‐fig2.jpg  N

N

NH2

OO

O

OH

P

O

OH

P

O

O

OH

OH

OH

O

O

P

N

N

NH2

OO

O

OH

IncorporationDetectionDeblockFluorophore Removal

Roche 454 Pyrosequencing Illumina® Reversible Terminator Sequencing

Sample Set

Sample Preparation: Roche 454 Pyrosequencing

Sample Preparation: Illumina® Reversible Terminator

Sequencing

Amplification of HV1 and HV2 regions using Fusion

Primers

HV1 and HV2 amplicons from each sample were assigned a

single multiplex identifier (MID) for sample parsing prior

to data analysis

Figure 1: Chemistry and workflows of benchtop NGS instruments. On the left, Roche 454 Pyrosequencing chemistry is presented. Initially, DNA target regions are amplified using PCR with a high-fidelity polymerase, and modified primers containing 454 adapter sequences and multiplexing barcodes 5’ of a template specific sequence. Multiplexing barcodes enable post-run sample parsing, while adapters allow for clonal amplification on the surface of a bead, which is then loaded onto a PicoTiter™ plate (PTP) where sequencing chemistry occurs. Native dNTP’s are flowed iteratively across the surface of the PTP. Release of PPi following incorporation initiates an enzymatic cascade that ends in the ATP catalyzed conversion of luciferin to oxyluciferin, and the production of light. The amount of light produced is directly related to the number of nucleotides in the region being sequenced. On the right, Illumina® chemistry is shown. DNA target regions are amplified using traditional PCR, and a high-fidelity polymerase. The resulting amplicons are fragmented enzymatically using the Illumina® Nextera® XT tagmentation kit, which also enables incorporation of sequencing adaptors and multiplexing barcodes during a limited cycle PCR step. Libraries are then added to an optically transparent flowcell, where they bind to anchored oligonucleotides complimentary to their incorporated adaptor sequences. Clusters consisting of clonal populations of individual template molecules are generated on the instrument. Sequencing takes place after cluster generation when modified dNTPs with base specific fluors, and blocking groups on their 3’ –OH groups are flowed across the flowcell. After each cycle, LED lights excite nucleotide specific fluors, and the software images the surface of the flowcell to catalogue the specific base incorporated at each cluster.

001HV MiSeq

003HV MiSeq

001WG MiSeq

003WG MiSeq

001HV Roche

003HV Roche

9947A HV

MiSeq

9947A WG

MiSeq

9947A HV

Roche

92  

93  

94  

95  

96  

97  

98  

99  

100  

101  

102  

Ave

rage

freq

uenc

y (%

)

Average Frequencies of Expected Variants for Sole Source Samples

98.9% 1.6

99.5% 0.5

0  

20  

40  

60  

80  

100  

120  

Var

iant

Fre

quen

cy (%

)

Frequencies of Expected Variants in Mixed Samples

S S S S Buccal swabs (x20) were obtained from two donors (001-CF30 and 003-CM54) whose whole mtGenome had been previously characterized in our laboratory using Sanger methods. The swabs were collected according to approved IRB protocol. DNA from each set of 20 buccal swabs was extracted independently using the Qiagen DNA Investigator kit surface and buccal swab protocol. A single RB was extracted alongside each set. The resulting extracts from each donor were pooled to create a large volume master extract. Pooled samples were quantified in quintuplicate using a human mtDNA specific real-time PCR assay developed by Mark Kavlick at the F.B.I.1 Quintuplicate values were averaged after outliers were removed. Averages were used to prepare mixtures of donors in defined ratios of 5, 2, 1 and 0.5%. Donor 003-CM54 was used as the minor contributor in all mixtures. Final mixtures and sole source samples were quantified in quintuplicate using both the Quantifiler® Human kit, and the human mtDNA specific real-time PCR assay mentioned above. All quantified samples were distributed to IFGC laboratories for sequencing as requested.

Sequencing on Roche GS Junior

Amplification of HV1 and HV2 using traditional F.B.I.

primer sets

Amplification of whole mtGenome in two long overlapping amplicons2

9,065 kb 11,170 kb

UNEXPECTED VARIANTS IN HV DATA – SUMMARY

EXPECTED VARIANT FREQUENCIES – SOLE SOURCE SAMPLES

EXPECTED VARIANT FREQUENCIES IN HV REGION – MIXED SAMPLES

001HV 003HV 5%003HV 2%003HV 1%003HV 0.5%003HV 9947AHV

M R M R M R M R M R M R M R

AàG; GàA 112 56 92 82 94 53 92 57 107 NA 102 51 63 30

CàT; TàC 117 66 118 73 99 67 107 71 108 NA 103 72 78 46

Transversions 5 15 3 2 3 0 0 2 2 NA 4 1 2 10

Insertions 10 16 10 17 13 15 14 17 14 NA 15 13 9 11

Deletions 15 31 14 37 13 28 14 37 15 NA 13 27 9 27

MNVs (multi-nucleotide variants)

5 3 1 5 1 7 0 3 1 NA 2 3 1 1

Total 264 187 238 216 223 170 227 187 247 NA 239 167 162 125

Average Coverage

2,346 121 15,312 121 10,939 122 10,967 126 14,327 NA 14,151 115 5,459 135

WHOLE GENOME SAMPLES – COVERAGE MAPS

Figure 2: Bar graph showing average frequencies for expected variants in HV1 and HV2, detected in sole source samples. The expected variants are those that were previously detected for each donor using Sanger sequencing, and are expected be detected at or near a frequency of 100% unless otherwise noted. Averages of reported frequencies and accompanying standard deviations were calculated for the expected variants of each donor. Error bars represent standard deviations of frequency averages. Donor 001-CF30 is expected to possess a total of 9 variants from the rCRS across the HV region, with varying levels of heteroplasmy at position 16093 (data not shown). This heteroplasmy is contributing to higher standard deviation values for all donor 001 samples. Donor 003-CM54 was expected to possess a total of 14 variants from the rCRS in the HV region. Differences in alignments were often observed at positions 150C, 152T, 16193C and 16195T, expected variants for donor 003-CM54 (figure 4). When this was true, frequencies for these variants were not used to calculate frequency averages and standard deviations (omitted: 150/152-003HV MiSeq; 150/152, 16193/16195-003HV Roche). The 9947A control was expected to possess 5 variants from the rCRS, with no apparent heteroplasmy in the HV region.

Figure 3: Bar graph showing average frequencies for expected variants in HV1 and HV2, detected in mixed samples. The expected variants are those that were previously detected for each donor using Sanger sequencing. All variants shared between donors are expected to be detected at or near a frequency of 100%, barring heteroplasmic sites. Conversely, depending on the mixture, the donor specific variant frequencies change (95:5; 98:2, 99:1, and 99.5:0.5% respectively) with donor 003-CM54 representing the minor contributor in all mixtures. In the HV region, donor 001-CF30 and donor 003-CM54 share a total of 5 variants. In addition, donor 001-CF30 possesses 4 unshared variants from the rCRS, and donor 003-CM54 possesses 8. Similar to the data presented in figure 2, heteroplasmy for donor 001-CF30 at position 16093, is contributing to higher than expected standard deviations in frequency averages. Also, frequencies for variants with observed alignment anomalies were omitted from calculations in this data set (omitted: positions 150/152 from all donor 003 samples and positions 16193/16195 in 5%003HV and 1%003HV data generated with the MiSeq). In all MiSeq HV sample data, frequencies of minor contributor variants were higher than expected. This was not observed in data generated using the Roche GS Junior. This was also observed to a much lesser extent in whole genome data also generated using the Illumina® MiSeq (see figure 4). This is likely a result of differences in library preparation, since the same samples were used for all sequencing runs. Additionally, in the 0.5% mixture, all minor contributor variants dropped out (were below the 0.1% minor variant frequency threshold) except 1. This is a direct result of the higher coverage achieved with the Illumina® MiSeq, and shows that the MiSeq is more sensitive than the Roche GS Junior.

Table 1: Summary of characterized unexpected variants from HVI and HV2 samples in Roche GS Junior and Illumina® MiSeq data. All unexpected variants were counted using Galaxy open-source cloud computing software. The resulting counts were manually categorized and tabulated. In all cases, when compared to the average coverage calculated for each sample, the number of unexpected variants in Roche GS Junior data sets is large compared to the total number of unexpected variants in Illumina® MiSeq data sets. This indicates that the Roche GS Junior yields more noise than the Illumina® MiSeq. However, additional research is needed to further characterize the unexpected variants as noise. The 9947A control, derived from a human cell line possessed fewer numbers of unexpected variants when compared to the samples obtained from buccal swabs.

Figure 5: Coverage maps for Illumina® MiSeq whole genome data (donors 001-CF30 and 003-CM54) obtained from Illumina® MiSeq Reporter secondary analysis software. Coverage a c r o s s t h e e n t i r e mtGenome was achieved u s i n g a l o n g P C R approach developed in our laboratory. Q-scores are high (≥30) throughout the length of the genome.

The authors of this work wish to gratefully acknowledge Illumina® for their continuing support. Special thanks to Cynde Holt, Joe Varlaro, Joe Walsh, Kathryn Stevens, Carey Davis and Tom Richardson. We also wish to thank Western Carolina University for supporting some of this work.

EXPECTED VARIANT FREQUENCIES – WHOLE GENOME DATA

0  

20  

40  

60  

80  

100  

120  

Var

iant

Fre

quen

cy (%

)

Frequencies of Expected Variants in Mixed Whole Genome Samples S=Shared, 001=Donor 001-CF32, 003=Donor 003-CM54

S S S S

003 7.05% 0.91

003 2.67% 0.38

003 1.43% 0.16

003 0.72% 0.05

001WG 003WG 5%003WG 2%003WG 1%003WG 0.5%003WG 9947AWG

AàG; GàA 3384 4064 3428 3341 3269 1775 2972

CàT; TàC 3330 4010 3477 3400 3255 1776 3116

Transversions 47 324 67 51 89 14 78

Insertions 40 48 43 42 34 36 23

Deletions 210 316 219 180 269 118 203

MNVs (multi-nucleotide variants)

26 26 29 27 39 29 20

Total 7037 8788 7263 7041 6955 3748 6412

Average Coverage

11,477 9,920 10,274 9,270 14,326 14,151 13,294

Table 2: Summary of unexpected variant characterizations for whole genome samples from Illumina® MiSeq data. Average coverage for whole genome samples is high despite being sequenced in the same run with 7 samples consisting of pooled HV amplicons, 21 unrelated samples consisting of 2 pooled amplicons, and controls. The types of variants seen across the whole mtGenome sequenced using the Illumina® MiSeq are consistent. No whole genome data was obtained from the Roche GS Junior, so no direct instrumentation comparisons can be made with this data.

Figure 6: Bar graph showing average frequencies for expected variants for whole genome data, detected in mixed samples. All variants shared between donors are expected to be detected at a frequency of 100%. However, depending on the mixture, the donor specific variant frequencies will change (95:5; 98:2, 99:1, and 99.5:0.5% respectively) with donor 003-CM54 representing the minor contributor in all mixtures. Frequency averages and standard deviations were calculated for the expected variants of each donor. Error bars represent standard deviations of frequency averages. Across the genome, donor 001-CF30 and donor 003-CM54 share a total of 21 variants. In addition, donor 001-CF30 possesses 7 unshared variants from the rCRS, and donor 003-CM54 possesses 18. Frequencies for variants with observed alignment anomalies were omitted from calculations in this data set as well (omitted: 150/152 in all samples; 16193/16195-5%003WG ). Furthermore, several data points were not detected in 0.5%003WG. A large potion of data is missing from this set. More research is being conducted to determine the cause. Overall, the frequencies of the expected variants seem to more accurate when using our whole genome approach, than when using an small amplicon approach (figure 3).

CONCLUSIONS • The Illumina® MiSeq allows for rapid enzymatic library preparation, higher throughput, and longer read-lengths than the Roche GS Junior. • Much higher coverage is achieved per sample when using the Illumina® MiSeq versus the Roche GS Junior, even with additional whole genome samples included in the MiSeq run. As a result, we feel that the MiSeq can result in higher sensitivity depending on the design of the run, and lead to detection of minor variants at a lower frequency than the GS Junior. • Additionally, it appears that the Roche GS Junior is more prone to noise than the Illumina® MiSeq, especially in areas containing homopolymers. However, this observation is not novel. Additional work is being done to reanalyze the data from the Roche GS Junior with filtering mechanisms designed to remove sequencing artifacts associated with homopolymers. Also, unexpected variants which were labeled as “noise” for this work, are being further studied to identify positions that may be consistent with true biological variation. • The types of unexpected variants observed in data obtained using both instruments is consistent across samples sequenced. • Overall, we feel that the Illumina® MiSeq is well-suited for forensic mtDNA analysis though additional validation studies are required to move this technique into the crime laboratory.

HV1

HV2

15997

16391 48

409

Nextera® XT – Enzymatic Library Preparation

Amplicons are pooled for each sample

Sequencing on Illumina® MiSeq – 2x151 v2 kit

Roche GS Junior HV amplicons 2 sole source samples 4 mixed samples 1 positive control (9947A) 2 reagent blanks =7 samples + 2 reagent blanks

Illumina® MiSeq HV Amplicons (pooled) 2 sole source samples 4 mixed samples 1 positive control 2 reagent blanks Whole genome amplicons 2 sole source samples 4 mixed samples 1 positive control 2 reagent blanks +21 additional unrealated samples =35 samples + 4 reagent blanks

Samples Sequenced on each NGS run

Library preparation (emPCR and cleanup)

5%003HV MiSeq

5%003HV Roche

2%003HV Roche

2%003HV MiSeq

0.5%003HV Roche

0.5%003HV MiSeq

1%003HV MiSeq

S S S 001

001 001 001 001 001 001

003 12.9% 4.52

99.82% 0.12

85.1% 3.56

99.2% 1.44 94.9%

1.75

003 2.57% 0.07

99.8% 0.23

92.13% 3.52

003 3.45% 1.15

99.5% 0.65

96.1% 1.92

003 1.74% 0.59

99.75% 0.28

94.5% 3.41

003 2.51% 0.38

99.8% 0.18

94.8% 4.52

003 1.59% 0.55

99.7% 0.26

97.9% 1.88

003 0.1% NA

S=Shared, 001=Donor 001-CF32, 003=Donor 003-CM54

001 001 001 001 99.4%

0.64 90.3% 3.08

99.2% 0.86

94.6% 3.36

99.3% 0.81 96.2%

3.56

98.8% 2.37

96.7% 3.27

Donor  ID  Variant  frequency  Standard  devia.on  

5%003WG 2%003WG 1%003WG 0.5%003WG

99.3% 3.13

99.6% 0.4

99.0% 1.72

99.3% 0.57

99.5% 0.4

99.8% 0.09

99.8% 0.2

Average  frequency  Standard  devia.on  

ALIGNMENT DIFFERENCES

rCRS TCATCCTATTATT 003SANGER TCACCTTATTATT rCRS TCA_TCCTATTATT 003CLC TCATTCC_ATTATT

rCRS position 150 152

rCRS position 148 152

150: CàT Transition 152: TàC Transition

148: Insertion T 152: Deletion T

Sanger data with positions 150 and 152 highlighted

NGS data with alignment

highlighted

Figure 4: Alignment differences in Sanger and NGS data at positions 150 and 152 for donor 003-CM54. Differences in alignment at positions 150 and 152 were detected across all samples containing DNA from donor 003-CM54. Alignment discrepancies between bioinformatics algorithms must be considered when implementing NGS into databasing and casework laboratories.

Note: No HV1 data was obtained for sample 1%003HV Roche. Data for this sample is not shown.