transcript exploring genetic diversity - mapping the genetic … · 2017-01-25 · multiplexed pcr...

Exploring Genetic Diversity: Mapping the Genetic Landscape through Next‐Generation Sequencing

[0:00:00] Sean Sanders: Hello and welcome to this Science/AAAS webinar. My name is Sean

Sanders and I’m the commercial editor and webinar editor at Science. Slide 1 In today’s webinar, we will be exploring genetic diversity. When

attempting to characterize the genetic diversity of a population, we face a number of challenges, one of which is the large number of samples needed. An effective way to solve this problem is the use of multiplexed PCR reactions run with barcoding, which offers the possibility of sequencing hundreds of unique samples utilizing next‐generation sequencing technologies. This process, known as amplicon resequencing, allows for each PCR amplicon to be sequenced individually, enabling both the identification of rare variants and the assignment of haplotypes.

This webinar aims to provide you, the viewer, with an overview of

amplicon resequencing‐based approaches in the context of investigating the complexity and diversity seen in immune system genes. We will cover best practices for setting up amplicon resequencing projects and discuss the application of resequencing technologies to study immunologic and other forms of genetic diversity.

With me today in the studio, we have three exceptional scientists who

have generously agreed to share their expertise with us. We have to my left Dr. Michael Smith from SAIC‐Frederick and the National Cancer Institute, not too far from us in Frederick, Maryland. Dr. Francois Vigneault is next from Harvard Medical School in Boston, Massachusetts. And finally, we have Dr. Henry Erlich from Roche Molecular Systems in Pleasanton, California and Children’s Hospital Oakland Research Institute in Oakland, California. Glad to have you all here. Thanks for being here.

A reminder to our online viewers that you can see an enlarged version

of the slides by clicking the enlarge slides button located underneath the slide window of your web console. You can also download a PDF copy of the slides by using the download slides button. If you’re joining us live, you can submit a question to the panel at any time by typing it

into the ask‐a‐question box on the bottom left of your viewing console below the video screen and clicking the submit button. Please remember, keep your questions as concise as possible as this will give them the best chance of being put to the panel. We’ll get to as many of your questions as possible in the Q&A session following the presentations.

Finally, thank you to Roche and 454 Sequencing for the sponsorship of

today’s webinar. So let’s get started. I’d like to introduce my first speaker for the

webinar, Dr. Michael Smith. Slide 2 Dr. Smith obtained his Ph.D. from Johns Hopkins University in Baltimore

and completed his postdoctoral training in Molecular Evolution at the University of California in San Diego. He was assistant director of the Human Genome Center at the Salk Institute for Biological Sciences in La Jolla, California and was recently a principal investigator at SAIC‐Frederick in the Laboratory of Genomic Diversity. Dr. Smith is currently the director of the Genetics and Genomics Group with the Advanced Technology Program of SAIC‐Frederick at the National Cancer Institute. This group comprises two laboratories, the Sequencing Facility and the Laboratory of Molecular Technology. Dr. Smith provides scientific vision and management oversight to both of these laboratories.

Welcome, Dr. Smith. Thanks for being here. Dr. Michael Smith: Thank you, Sean. First of all, I’d like to thank Science for sponsoring the webinar and

Roche for sponsoring the webinar. I was tasked with talking about genetic diversity and a sort of introduction to that. We’re trying to get the slides right on our end here. And today, we’re going to cover that topic and really going to start with an introduction.

And so we have topics that I’d like to cover: genetic diversity itself,

what I’ll call massively parallel sequencing, it’s also got too many names at this point; next‐generation sequencing also; applications that are in use and how genetic diversity relates to those or genomic diversity; the challenge of error rates in these technologies; and then talk a little bit

about sequence capture as one method to look at genetic diversity; and also amplification with an example; and just talk a little bit about validation.

Some of the kinds of genetic variation we’re interested in are single

nucleotide polymorphisms and copy number variations, and of course, those are found on haplotypes, and sort of the simplest way to think about a haplotype is the sequence of a chromosome or a large region of DNA. And certainly, one of the challenges in the field these days is putting together the information about haplotypes.

Just a very quick population genetics primer to remind the audience at

least for the human case of the much higher diversity in Africa; that migration patterns are both ancient and recent are very important to take into account; Bottleneck in Europe 20,000‐30,000 years ago; and just to remind viewers of some tremendous resources out there that we really won’t go into today of the HapMap and the 1000 Genomes Project.

[0:05:07] Slide 5 So when we talk about massively parallel sequencing, you’ve got a

number of platforms out there that give you access to DNA sequencing on a scale that we certainly weren’t even thinking of 5 or 10 years ago, and those include Roche’s pyrosequencing or 454, a newcomer on the block from Pacific Biosciences for real time for single molecule sequencing, Illumina or Solexa’s sequencing by synthesis short reads, Life Technologies’ SOLiD, Helicos; Complete Genomics has a service; and then coming to the market, we have things like Ion Torrent and many other companies really pushing these new varieties of sequencing. I put a star next to the three technologies we have in the core facilities that are run at the NCI.

So for in terms of applications, we really can think about traditional and

genomic just sequencing or regional capture of DNA. We’re thinking about exomes. We can also think about amplicons, one of the topics of the day. And then also applicable to this are issues of say mRNA‐seq where you can also get some haplotypic information across large

regions because of the splicing together of exons. MiRNAs, ChIP‐seq, DNAse hypersensitivity, all of these topics, still genetic variation is a very important component, metagenomics, even methylation, and viral diversity and discovery.

One of the challenges of the field I think is the error rates and really the

different way the different platforms work. So on one extreme, we have PacBio which we think has a relatively high error rate, and at the other ‐‐ and sort of thinking about these boats racing against each other, at least that’s what I thought about this slide ‐‐ we have Illumina and Roche sort of in the middle ground and this is not a linear scale. And then Sanger is getting fairly accurate and then SOLiD is probably the most accurate of the day, and then we have these new technologies coming in and we don’t even know where the accuracy is with Ion Torrent and many of the others. This accuracy information is sketchy.

So in terms sequencing things of interest, there are several approaches

to getting fragments you’re interested in. One is the sequence capture approach. You can do this on an array‐based system like the one Nimblegen offers, and in essence, you are trying to capture the molecules that are on the left here. You’re trying to capture the molecules that are in blue. You hybridize them to oligonucleotides, throw on an array, and then later on you’ll loop them and sequence them on your favorite massively parallel sequencer.

You can do the same thing in solution except the capture probes are in

solutions. Each method has its advantages and disadvantages, and certainly, there’s a very recent review by Teer and Mullikin that I would point the viewers to if you’d like to read a little bit more.

What I’m going to be talking about today though is more on the

amplification side and we certainly have more fairly traditional approaches, if you want to call next‐generation sequencing traditional these days. But PCR and then pooling, that’s frequently used in the metagenomics field. There’s also the emulsion PCR solutions. You can think about the RainDance platform for that.

And what I’ll be talking about more is the nanofluidics solution to this and we’ve been using the Fluidigm access array and we looked at some miRNA loci. Today I’ll be talking about EGF and MET in the NCI‐60 cell lines. We’ve also started setting up some cancer gene panels.

‐11 In this EGFR and MET example I’m going to show you today, we

designed amplicons across the exons of these two genes, showed that we could amplify them. They are run out on an Agilent biochip.

And then we took those amplicons into the NCI‐60 cell lines which go

across the top of the full access array there and 48 amplicons along the side, and then using a tagging strategy, we were able to introduce barcodes into each of the NCI‐60 cell lines so we’re able to do a large number of PCRs all at once on a fairly small volume.

And the second, a partially filled access array, you can see one sample,

MDA. That’s a breast cancer cell line where we have additional amplification, where the EGFR receptor amplified up earlier than MET.

This is consistent with knowledge of this cell line where there’s roughly

30 copies of EGFR in that cell line. Slide 15 We then harvested the amplicons from the access array, ran them out

on a biochip. You can see the various‐sized products there, also trace down on the bottom corner.

[0:10:04] Slide 16 And we took those samples random on the Roche 454 sequencer and

looked at the sequence lengths that we obtained and how many of them were on target in red and how many were of all in blue. And you

can see we got some small sequences that were off target but really not that many.

We can also look at the average sequence coverage per sample and we

were able to get roughly at least 100 on average for each amplicon across the 60 cell lines shown across the bottom of this slide.

Another way to look at the same data is how many of the reads were

on target. It was roughly 91%. About 2% of them we had a barcode mismatch and just didn’t pursue those any further. We also had primer mismatches at about 8%, probably consistent with what we’ve seen elsewhere where primers have some missynthesized nucleotides in them.

The bottom figure shows the distribution of coverage across all the

amplicons, just to sort it out there for viewers to see. Slide 19 And then examining the variants that we’ve found, we saw a total of

3840 variants for EGFR and you see they’re broken into a number of categories there. 1400 of them roughly were new, and for MET, there were almost 6000 and 832 of them were new. Some of them were non‐synonymous, splice sites, synonymous. I won’t go through all the categories there, but clearly, this was a very good way to find many variants within these two genes, and so we’re fairly happy with those results.

One of the challenges you have with this kind of data is try to make

sure that you trust the data you’re getting, so one of the things we did was to put in four HapMap samples and look at the known variations in those samples, and we saw 100% concordance between the HapMap variants and the results of the sequencer. Of course, that’s a nice indication that the study worked well when you’re interested in specific variants and trying to validate them. There are a number of ways to do that these days. Of course, Sanger is the gold standard and we do that in many cases for various investigators. But it is moving to next‐generation sequencing and I would point to the recent paper by Kan, et

al. in Nature where they used the Roche platform to validate some of the variants they had seen. Others have certainly been using the Illumina short‐read platform for the same purpose.

This kind of work certainly involves a number of people. It’s sort of hard

to pick anyone out so I won’t, but I will say that it was work with a great team to get that particular project done. And with that, I’ll say thank you.

Sean Sanders: Great. Thank you very much, Dr. Smith. I’m going to just give you a very

quick question that came in asking what EGFR and MET are. So maybe you can just give a very quick explanation.

Dr. Michael Smith: MET is an oncogene and the EGFR, epidermal growth factor receptor. Slide 22 Sean Sanders: Receptor. Okay, great. Slide 23 So our next speaker today is Dr. Francois Vigneault. Dr. Vigneault

completed his undergraduate and graduate training at Université Laval in Québec City, Canada. He is currently a postdoctoral fellow in the laboratory of Dr. George Church at Harvard Medical School in Boston, Massachusetts. His focus is on developing innovative solutions for low cost and high throughput sequencing for various biological problems. He is best known for developing one of the first protocols for multiplex next‐generation sequencing of microRNAs, but he is currently focused on new technologies for the sequencing and analysis of the immune system.

Dr. Vigneault, welcome. Dr. Francois Vigneault: Thank you, Sean. One of the main goals in the Church lab is to develop technology to

tackle the most compelling biological problem and one of those problems we think is understanding immune response or how the antibody are made.

And in the scope of the Personal Genome Project, I guess the goal is if

you can understand the immune response of someone, you could tackle the memory of all the infection they had and you’ll also build a table of all antigen against all antibody and use that as a diagnostic tool or potential therapeutic tool. And today, I’m just going to show you a little piece of what we’ve been doing and all the technology we’re developing around that field.

One of the concepts that’s been used in classical immunology is

injecting an animal model with an antigen and then you pull from the serum, you get an antibody, and this antibody works to bind an antigen except nobody really can understand why it works. It’s just evolution. It’s very efficient. It works well. So we decided to look at this problem in a different direction. Basically, could you sequence all the antibodies that are made into that model and then see what kind of immune response is building up? And since you’re going to do this in a model, we might as well do it in human 'cause we have the capability to do so.

[0:14:56] Slide 26 So just a rundown, this is just a basic of what we did initially. It’s

basically we focused on B cell. We wanted to tackle the antibody made by the B cells. And here you have schematics basically of an antibody which is composed of a heavy and a light chain and a constant region. The constant region defined the isotype if it’s going to be IgM, IgD, IgG, A, or E, and the first initial step was to sequence just the heavy chain.

So to understand why we did it the way we did with the 454

technology, basically during the process of activation and maturation of your B cell, you basically have the VDJ locus which has a 46 V region, a 23 D region, and around 6 J region, and a constant region. And the germline is going to splice basically by picking one sample, one block of each of these regions and put it into the germline DNA of your mature B cell.

And during that process of VDJ recombination, the splicing sites

between the V and the D and the D and the J, there is addition and

removal of nucleotide to create a supervariable CDR3 region and this region is currently called for antibody binding and affinity. So this shows here that you cannot break this segment which is around 425 base pairs. You cannot break it into small tags because you will end up sequencing the V region and the J region, but you want to know that they belong to the same starting molecule. So 454 was really the obvious technology to start doing this kind of work.

And then you have the light chain also that is made to make that

antibody, but the initial step was just to see if it was feasible to sequence the heavy chain only because we didn’t know how many are made in the immune system and if we can make sense of the data. So this is what we tried to accomplish.

So just to round into the experimental design, you basically take the

closest caveman you can find, in our case it’s George, and you basically gave him ‐‐ we gave him a multiple vaccine 'cause we want to make sure he would get an immune response. So we gave a flu vaccine of a five‐strain and we gave him hepatitis booster vaccine. And then we took at different time points ‐‐ blood draw at different time points. We took a blood draw two weeks before the vaccine, we took a blood draw an hour after the vaccine, and then 1 day, 3 days, 7 days, 14 days, 21, and 28 days after the vaccine.

So from this blood sample, you basically extract the lymphocyte and

then you extract the RNA, and using a set of a primer specific to the C region, you reverse transcribe the DV chain and you use a multiplex PCR with primer for all the V region and the C region to try to amplify the B cell immunoglobulin RNA transcript. And then you put that into your pipeline for 454 sequencing and then you get tons of data and then you start crying 'cause you get too much data till you don’t know how to handle them and it’s very complex to analyze.

And here, I have a picture of Uri Laserson. He’s been working full time

on this with me. He’s been the guy doing all of the analysis of this. He’s very talented.

Basically, the very first thing you do, you look against the IMGT

database to see which V or J segment you have so you can partition all your reads into subgroups, and then you extract the central region

which has the CDR3 which is supervariable and then you cluster based on this. So at the end of the day, all your 5.5 million reads end up being clustered in subgroups of very likely similar sequence coding for the same antibody, and then you can move on into your biological analysis.

The first thing we wanted to look is if it was feasible to make sense of

all those data, and here, you have a plot of two sequencing replicates and one technical replicate. And the scariest thing we saw when we got this is only 3% at the center. It’s 3% of the clone were basically overlapping within those three runs. And you would expect sequencing replicates which is sequencing twice the same library prep, you know, to be pretty much the same but it’s not. It’s just a bit scary to see that there is so much diversity made.

But the interesting thing is if you look at for each clone how many reads

they ended up being, you see that that 3% represent around 60% of all those reads. Just to show that the highly expressed reads, you’re more likely obviously to find them between runs. And since we were tackling immune response, we’re mainly after a highly expressed antibody 'cause your body is trying to fight that immune challenge. If you were going to go after a rare antibody, it would be even more tricky. You need to sequence a lot.

Here is a plot of clone frequency. It will cost a lot of time to be able to

achieve reproducibility and all the technical detail, library prep to be able to get such correlation. And it shows again, if you look at the top, the highly expressed clone coalesced very well, and at the bottom, you get a lot of unique sequence. So it’s just a huge, huge diversity. I don’t have the figure here but you get 106 basically the heavy chain made in one person, not at every point.

[0:20:00] Slide 33 Here’s an interesting picture on the left that basically will just show you

the V and J usage. So it’s pretty uniform. You get some segments that are more interesting that have more reads. And on the right, that’s

basically the CDR3 size distribution. Like I said, during the recombination, you get short of nucleotide and addition of nucleotide, and we got some segments that had nothing, like zero base pair and all the way up to 146 base pair. So obviously, an antibody that has 146 base pair is definitely not going to bind to the same thing than the one with zero base pair. So it’s definitely interesting to see such variability. And you see the pattern also of the nucleotide coding for protein here and the pattern of the sequencing.

Then we decided to plot the clone frequency in a time course plot. So

at the bottom, you basically have ‐14 day and then 0, 1, 3, 7, 14, and 0 is really +1 day. When we plotted this, we wished we had more sample. We wished we had like a ‐1 hour prior to vaccine and stuff like that. So we started doing this again. That’s under sequencing right now. And at the top little inlet, you see it’s basically a log view of the same data so that you can see how much diversity of sequence are made in someone. This is a unique individual at multiple time points.

Classic immunology focused a lot on the day 7 especially for flu, the IgG

of day 7. And you kind of see a little something going at day 7. When you see something rising at day 21 and that’s likely memory compartment buildup. But the most interesting thing is that day 1, like even one hour after the vaccine and peaking a day after the vaccine, there’s just a massive amount of sequence made to top that clone at 16000 sequence coding for an antibody. And most people have been focused on the day 7, and obviously, there is a lot more to look into than just that day 7.

And this is likely an antigen recall because George has been vaccinated

every year and has been exposed to flu. So it’s definitely interesting to see such response.

So the conclusion of that figure is basically, you can see something

happening. It’s not just noise, you know. So there is some hope. Slide 35 So we used some clustering basically to try to make sense if we could

filter out some of the most important antibody sequence candidates. And if you look at some clustering, it’s more interestingly easy to find that potential clone that would make sense and then you can filter out

and replot those time frequency, and now I have a clear picture of the day 7 clone, day 21, and the day 1 clone.

So we decided basically to start doing some functional screening on

this, and basically, I’m not showing the data 'cause it’s not finished. But basically, when you couple it with a light chain, we’ve been expressing the protein with in vitro system and we basically looked at if they’re binding to the antigen, we give it to George. And more than half of them have been binding to the flu and the other half, we didn’t put our hands on the hepatitis antigen yet. So we’re hoping that those will bind, but our success rate is pretty, pretty high compared to what we were expecting so we’re quite happy.

From this basically, we have a lot of other sub‐technology coming out.

Our goal is truly to make a lookup table of antigen against antibody. So what we want to do is have a heavy and light chain coupling at the single cell level and millions of cells at once. So this is what we’re really working on now that we know that sequencing just the heavy chain showed some promise. So now, we want to sequence the full thing in a single cell, and hopefully, we’ll achieve that in the near future.

On that, I’d like to thank George of his blood and moral support. He got

so many needles in his arm over the past two years from us. It’s just painful to watch. Uri Laserson who did all the analysis pipeline, really talented. He did really good. Ido Bachelet has been involved a lot in the functional screening of the antibody at the protein level. And I have Srivatsan “Bob” who has been doing some plotting and modeling with us, and all the people at 454, Michael Egholm, Birgitte Simen, have been great on their team.

Thanks. Sean Sanders: Great. Thanks, Dr. Vigneault. Slide 38 It’s great to see George playing such a physically active role in his

research. Excellent. Slide 39

So our final speaker today is Dr. Henry Erlich. Dr. Erlich received his B.A.

in Biochemical Sciences from Harvard University and his Ph.D. in Genetics at the University of Washington. He then completed successive postdoctoral appointments at Princeton University and Stanford University. Dr. Erlich joined Cetus Corporation and served as Director of the Department of Human Genetics where he pioneered the development and application of PCR. He moved to Roche Molecular Systems in 1991 as Director of Human Genetics, becoming Vice‐President of Discovery Research in 2000.

Since 1991, Dr. Erlich has also been a researcher at the Children’s

Hospital Oakland Research Institute. His research at Roche Molecular Systems involves analysis of human genetic variation and genetic predisposition to a variety of diseases, with a focus on autoimmune diseases and the development of HLA typing tests.

Dr. Erlich, thanks for being here. [0:25:06] Dr. Henry Erlich: Thanks, Sean. What I’d like to do today is focus on the application of next‐generation

sequencing to the analysis of the extensive allelic diversity that’s found at the HLA loci.

So the genes in the HLA region are actually the most polymorphic genes

in the human genome, and as you can see from this slide at the right end, if you look at the class 1 genes A, B, and C, that for many of these loci, particularly HLA‐B, there are well over 1000; in the case of B, over 1600 different allelic variants, and that can be compared to the number of serologic specificities, 68 shown in green. They can be distinguished by serologic typing.

So in addition to the class I loci, there are also the highly polymorphic

class II loci where the two genes, the A1 and the B1 gene encode in the alpha chain and the beta chain that form a cell surface heterodimer, and these are also highly polymorphic. And in particular, I want to point your attention to the DR region where there is a B1 locus, it has 785 different allelic variants, and all chromosomes have a B1 locus, but

most but not all chromosomes have a second DRB locus which is either DRB3, DRB4, or DRB5, and I’ll return to that later.

Now, one of the characteristics of this HLA polymorphism is that

polymorphism is localized to the exons that encode the peptide binding group shown here in this structural model. So on the right is a model for the class II HLA antigen DR, and on the left for a class I antigen HLA‐A. And as I said, most of the polymorphic codons encode amino acids that either point into the group and interact with a peptide or with a T‐cell receptor.

Now, HLA matching as many of you know is absolutely critical to the

success of bone marrow transplants, and here is a slide which shows that for 8 out 8 matched loci or 8 out of 8 matched alleles, so that’s looking at four loci, HLA‐A, B, C, and DR, and 8 out of 8 means that both alleles had all four loci are matched into the donor and the recipient. And you can see if there’s a single allelic mismatch, 7 out of 8, or two, 6 out 8, that the survival rate for the recipient is significantly reduced. And a mismatch can be as little as a single amino acid, the difference between the donor and the recipient.

Now in this case as I mentioned, there are four loci, A, B, C, and DR, but

in many transplant centers, they also try to match the other class to locus DQ and then some third HLA class to locus DP.

Now, the extensive allelic diversity can also be very, very useful for

looking at human evolution in populations, and this particular phylogenetic network is based on haplotype frequencies at the HLA‐B and HLA‐C locus, and this is work done by my colleague, Steve Mack, as part of an international collaboration. And the data is quite complex, but I just want to point out at the very bottom of this phylogenetic network are a number of African populations clustered together. And then as you can see, the next most closely related from a genetic point of view populations are Northern Africa and European, and then the phylogenetic network spreads out to the rest out of the world, other parts of Asia and North and South America.

So the point I want to make here is simply that the allelic diversity of these loci is so great that just looking at two different loci, one can establish a very meaningful and informative phylogenetic network relating different human populations.

Now also, as many of you know, specific alleles at HLA genes are highly

associated with a variety of different disease. Many of them are autoimmune like type 1 diabetes but some of them are cancers, infectious diseases or allergic hypersensitive responses to specific drugs such as Abacavir. HLA‐B*5701 is associated with an allergic hypersensitive response to Abacavir and HIV patients who are B*5701 should in fact get a different HIV drug.

[0:30:05] Slide 45 This just shows how strong some of these associations are. For type 1

diabetes for example, and this is the work of the international consortium, the Type 1 Diabetes Genetics Consortium, that there are highly susceptible DR‐DQ haplotypes and highly protective DR‐DQ haplotypes, and this just shows when you look at the genotype level and shown circled in red that some of these genotypes have odds ratios well over 40. So these are very significant risk factors for type 1 diabetes but also for a variety of other autoimmune diseases.

Now, one of the challenges in HLA typing comes from the fact that

these genes are extremely polymorphic and the polymorphism is localized to just a few exons. So some of the so‐called ambiguity in HLA typing, and by ambiguity I mean that the raw data is consistent with a number of different possibility genotype assignments, comes from the fact that not the whole gene has been covered by the typing system, but much of it comes from the inability to set phase between closer linked polymorphisms that are identified by the typing system. And here diagrammatically, I just showed that if you’re sequencing a heterozygote individual, phase ambiguity means that you are not entirely sure which genotype to assign.

Now, one solution to phase ambiguity is to use a clonal sequencing system such as the 454 GS FLX System shown here schematically, and the way one can achieve clonal sequencing in amplicon sequencing is to do your genomic PCR and have the primers have both MID tags or barcodes embedded in the primers as well as the A and B adaptors for 454 sequencing. Then one quantifies the amplicon, mixes it with capture beads so that on average, most of the beads contain only a single molecule, and then that single molecule is amplified many thousands of times, millions of times in an emulsion PCR. And then the beads that contain these amplified DNA fragments are recovered from the emulsion PCR and put into a pyrosequencing reaction on the 454 picotiter plate. And as many of you know in pyrosequencing in the 454 system, when a base is incorporated, it liberates a photon and that is captured by a CCD camera and that is in fact the basis for 454 pyrosequencing.

Now, when we wanted to adopt 454 sequencing to the HLA system,

there are a number of options. One could either amplify the entire gene, cleave, and ligate these adaptors. One could do sequence capture as Mike indicated earlier then cleave and ligate adaptors. But we’ve found that by far, the most robust system was to amplify the exons with the MID tags and capture A and B sequences in the primers so that once you had your library, once you’ve done your PCR, you have your library.

And as I said, the benefits of clonal sequencing are that you can set

phase for linked polymorphisms within the amplicon, but you can also use generic primers such as DRB, primers that amplify all members of the multi‐gene family, and because of the clonal sequencing aspect and because of the properties of the genotyping software, one can sort sequences and then recover the allelic variants at DRB1, DRB3, DRB4, and DRB5. And the same property also allows you, if your primers amplify pseudogenes or other related genes, the clonal sequencing and the software can separate the noise from the target sequence signal.

And so what we did in the study that was published recently by Bentley,

et al. in 2009, we used 14 primer pairs to sequence 8 loci and 24 samples. And so we had 12 MID tags and we used two regions of the

picotiter plate so that we could do these 24 samples in a single run. And shown here are the exons that we chose to sequence for the class I loci, exons 2 and 3 and 4; and for most of the class II loci, exon 2.

And shown here is the actual analysis of these sequenced files

generated by the 454 system and we were fortunate to work with the HLA genotyping software company called Conexio Genomics, and this shows the output of the system.

[0:35:09] So the software consolidated the sequences, sorts them according to

genomic primer sequence that is to the exon of a particular locus and based on the MID tag to a particular individual. And shown under the exon 2 marker, you can see that there, I think it’s 82 sequences of one allelic variant in the forward direction, number of other forward sequences of other allelic variant, and then it also shows the number of sequences that are recovered for the two reverse sequence reads for the allelic variants.

And shown here is the unique assignment that once it’s aligned all

these sequences, it compares them with the HLA sequence database, and in this particular case, there’s a unique sequence assignment that shows zero mismatches with the sequence database. And the nomenclature is such that the first four digits involve non‐synonymous coding variants. The fifth and sixth digit are synonymous. So typically, we refer only to the first four digits because they’re the ones that are biologically important.

This shows class II assignment of the same cell line, and again, here you

can see that there are two different genotype assignments that show zero mismatches with the database and this is because of the distinction between alleles 0705 and 0706 are based in exon 5. So that’s why in this particular case, we’re not distinguishing those variants.

I should point out that all of these studies so far were done with GS FLX

and the standard GS FLX chemistry which is capable of reading 250 base pairs. As many of you know, the Titanium system has now been introduced for amplicon sequencing. That is capable of reading up to

450 base pairs. And in addition to the GS FLX instrument, there is now a new small instrument known as GS Junior, a desktop sequencing instrument that is capable of generating the same kind of reads that generates only about 1/8 of the number of reads but is much smaller and less expensive and so very practical.

But the reason I pointed out Titanium is that now, we’ve expanded our

coverage so we now, instead of just exon 2, 3, and 4, we do exons 1 thru 5 on most class I loci shown here. And now, because we can get longer sequence reads, we can combine exons 1 and 2 as a single amplicon. And here, you can see we can make the distinction between 0705 and 0706 because we’re now capturing reads for exon 5.

Now, recently we just completed an alpha study with eight different

sites, many of whom had never used 454 or the Conexio genotyping software before, and shown here in this table is the genotype concordance and allele level concordance for the various loci. So we were actually very pleased with the robustness of our typing system and the performance of the participating labs. So you can see for genotype concordance there’s 97%, and for allele concordance there’s 98%.

Slide 56 I think in the interest of time, I will skip that slide and show one specific

application. One of the powers of next‐generation sequencing is the ability to do ultra‐deep sequencing and identify sequence variants that are present perhaps 1% or even less in some cases.

So we looked at a SCIDS patient, severe combined immunodeficiency

syndrome patient, those provided by my colleague Beth Trachtenberg from Children’s Hospital Oakland, and we wanted to see if we could detect the presence of maternal cells in the circulation. And the marker for maternal cells would be the non‐transmitted maternal allele, and shown here for HLA‐B, you can see that the recipient of the transplant, the SCIDS patient, received 3905 from the mother and 3902 from the father. So 3512 was a non‐transmitted maternal allele, and we actually were able to find this in about 1% of all the sequence reads.

And this is true also when we looked at HLA‐C. 0401, which is a non‐

transmitted allele, was also found in about 1%. Slide 58 And so we could estimate then that about 2% of all the cells in this

child’s circulation were of maternal origin. Slide 59 Now, the last two slides that I want to show illustrate the power of

Titanium not only to do longer reads but also to recover more reads than the standard GS FLX. There are three times as many wells in the Titanium picotiter plate and so the number of sequence reads that can be recovered in a single experiment is much greater.

[0:40:18] So this was a collaboration with the Type 1 Diabetes Genetics

Consortium and we ran 500 samples in a single run. So these samples were the DRB primers with 32 different MID tags and we used 16 regions of a GS FLX picotiter plate, and we were able to show that there are really two different kind of DR3 haplotypes. DR3 is a high‐risk haplotype, DRB1*0301, high‐risk haplotype for type 1 diabetes, but as a second locus, they can be either a 101 for DRB3 or 0202 for DRB3. It turns out that the DR3 containing haplotypes and have DRB3*0202 are at higher risk. This just shows the long reads with the Titanium sequencing and this shows the comparison that in a DRB3 homozygote that has 0202 at the DRB3 locus, the odds ratio is 25. If it has a 101, the odds ratio is about 3.4, so significantly higher if they carry the DRB3*0202 allele.

So finally, I’d like to acknowledge the contributions of my many

collaborators, colleagues at RMS, at Conexio Genomics, at 454 Life Sciences, and also at Roche Applied Science. Thank you.

Sean Sanders: Great. Thank you very much, Dr. Erlich. Slide 63

Thank you all for the very interesting and informative presentations and

we’re going to move on right away to the Q&A portion of our webinar. Just a quick reminder to everyone, you can still submit your questions by just typing them into the ask‐a‐question box and clicking the submit button.

So the first question I’m going to throw out to the panel is most

commercial multiplexing kits seem to be designed to use about 12 different barcodes. Can you give some advice on how to multiplex say 96 samples? So I want to start with you, Dr. Smith.

Dr. Michael Smith: Well, I think we were using 96 in some of the other experiments in the

lab, but I think the advice would be to not sequence. Make sure you have enough real estate on the sequencer if you’re going to do 96 and to pool very carefully or in a very equimolar fashion. Of course, the barcodes are available from Roche if you’re on that particular platform.

Sean Sanders: Uh‐hum. Dr. Vigneault? Dr. Francois Vigneault: If you want to save on money, you know, and you don’t want to order

let’s say a plate with 384 different barcodes of primer, you can start like crossing. You could have a set of barcodes on one hand of your transcript and cross it with another set so that you can reuse those barcodes in some fashion. But yeah, the real estate is a big issue. With microRNA sequencing, we did that mainly on the Illumina System, and if you put more than 10 samples, you’re going to end up being short on how many read per sample you get. So you got to be careful how many samples you’re putting on.

Sean Sanders: Uh‐hum. Dr. Erlich, anything to add? Dr. Henry Erlich: Well, in addition to using multiple barcodes, of course you can use

multiple regions of the picotiter plate. So in the example that I pointed out, we sequenced 500 samples in a single run of the GS FLX instrument with 32 barcoded primer pairs and 16 regions, but you know, as both Francois and Mike have pointed out, it’s critical to ensure that you have sufficient coverage of each sample for each amplicon to get adequate genotyping.

Sean Sanders: Uh‐hum. Okay. I’m going to stay with you, Dr. Erlich. We have a

question asking about bias. Did you notice any reading coverage bias resulting from sample pooling methods, equal volume, equal mass, or equimolar?

Dr. Henry Erlich: Well, for HLA sequencing and HLA typing, there are a number of

challenges. One, since we have 14 different amplicons or 14 primer pairs and actually more amplicons because the DRB primers amplify DRB1, 3, 4, and 5, that it is very important when you pool to get equal sequence recovery to the extent possible of all the different amplicons, because of course, your genotyping is limited by the amplicon for which you recover the fewest sequence reads. So that’s critical. And for HLA, you also want to make sure that there’s minimal allele bias, that in a heterozygote individual, you don’t recover many more copies of one allele than the other.

But it is possible. To achieve that, you have to design the primers

appropriately and design the PCR conditions appropriately. And if for example you’re getting more reads for one amplicon than another, when you do the pooling and you go into emulsion PCR, you can put more amplicon from the one for which you get the fewest reads.

[0:45:15] Sean Sanders: Uh‐hum. Dr. Francois Vigneault: The bias ‐‐ I’ve been trying to publish our paper lately. We got rejected

pretty badly on the first run so I hope I’ll get lucky on the second. But again, it’s microRNA sequencing. We compare our barcoding by ligation and barcoding by PCR, and barcoding by ligation. No matter where you put your barcode, you end up having some bias. MicroRNA are a good sample for that because it’s fairly small. There’s 1000 microRNAs per cell so you do see that bias. If you’re going to do like RNA‐seq, you’re probably not going to see that there is that kind of bias as easily as we did with microRNA. And it seems that it’s related to the ligase being picky on which sequence it’s using, so you cannot develop a set of barcode that works better 'cause you got no control on the receiving end of the genomic DNA, you know, and that has a different sequence so there’s definitely a bias. And the PCR was much better but it’s got some issue depending on multiplexity.

Sean Sanders: Okay. I’m going to come to you, Dr. Smith, for this one. Which

technique do you think is the gold standard for ensuring equimolar pooling of multiple samples for a library?

Dr. Michael Smith: I think we use two. One is PicoGreen to get very good concentrations,

and sometimes we look at the fragments on the Agilent very sensitive biochip assays. That’s what’s in use in the lab.

Dr. Francois Vigneault: I’ve been struggling a lot. I think I’ve tried every possible version of

pooling. That has always been a nightmare for me just to get that, and like especially I do a lot of gel extraction in my library prep and this is where I lose a lot of different regions of the sample. The PicoGreen is a good way. It’s been working pretty good. Lately, I’ve been using the KAPA quantitation kit. That’s been working extremely well. They have kits developed for all the company and all the systems. It’s been pretty good. It’s a bit more pricey though obviously 'cause you’re doing real time.

And very recently, I’ve been doing all my immune stuff using the new

rapid 454 system which has a prime on the 5’ end. So when you ligate those adaptors or if you make them do amplicon you could design your primer the same way. You can just put that on a plate reader and that tells you how many of each you have so you’re kind of shooting the PicoGreen assay right from the start. And we’re waiting on the data so I don’t know how good it’s going to be, but it looks like the gel and the bioanalyzer were correlating with the plate reader so I have good hope for that.

Sean Sanders: Okay. Another one for you, Dr. Smith. How would the error rates of the

different sequences that you’ve used affect the results that you showed?

Dr. Michael Smith: Well, I mean, the results we showed, we made sure that we had

multiple reads that supported those variants so we’re very confident they’re there. Obviously, sequencers with very high error rates are you have to build the error into the analysis, and that’s one of the challenges of the field was really the downstream, taking the flour that might get off the sequencer and turning that into bread. So the error rates are pretty important.

Sean Sanders: Okay. Coming to you, Dr. Erlich. What are the advantages of amplicon

resequencing over target enrichment protocols in your opinion? Dr. Henry Erlich: Well, I think for one they’re much more robust, and as Francois just

mentioned, just introducing the MID tags or the barcodes is a more robust system if they’re introduced in the primer rather than introduced in ligation. I think the other advantage is it’s quicker and it requires, you know, the capture takes I think overnight hybridization or it’s a long hybridization. So it’s much quicker and I think there’s great potential for multiplexity.

Now, it depends. I’m sure there’s a tradeoff where, you know, rather than running, you know, 300 or 400 different PCRs, it may be more advantageous at that stage to capture. But in the particular region that I’m interested in, the very highly polymorphic HLA loci, sequence capture is a bit of a challenge because if you just have a reference sequence on the chip or if you do in‐solution selection, you may not be able to capture with equal efficiency all the different allelic variants. But I think obviously they both have a place.

Sean Sanders: Great. Someone has sent in a question asking the panel to talk about

the difference between single‐end and paired‐end sequencing. Who would like to take? Dr. Vigneault?

Dr. Francois Vigneault: I did both so it depends on your application. So the paired‐end, I’ve

been using it mainly on Illumina. So if you have let’s say my immune system, so we’re developing a pipeline for Illumina as well except it’s going to fall short. I need 450 base pairs just for the heavy chain which works on one shot on the 454 and makes analysis a lot easier.

[0:50:05] So we’re going to try to do it on Illumina so we got a set of primers that

are in work, but we’re going to be missing a little section so now you need to try to find that section. If you end up being in the CDR3 region, I think that’s a big issue because this is the most important region. The analysis is a bit more complex, but there’s good software out there to do it. But it totally depends on your application. Like for microRNAs, it’s 22 base pairs and you don’t need paired‐end obviously.

Sean Sanders: Right, right. Actually, related to that, a question came in asking, how

are the repeats organized around the VDJ recombination regions and how long are they compared to the 454 sequence reads?

Dr. Francois Vigneault: It’s not so much about repeat. I’m not quite sure if he’s referring to the

V segment or whatnot. Those segments are different enough, the 46 V segments, so you can definitely align them to the biodatabase and know which V you’re dealing with. We’re dealing with RNA so everything is in the V, D, and J. The variable region is almost unknown. It’s just so variable that we use that as a clustering basis. But the PCR was tricky to get a primer that binds well. As soon as we start moving a primer around, it’s been missing out all the data. We’ve tried to use the BIOMED‐2 primer. I know the Andrew Fire’s lab has been using that with great success on DNA. I’ve tried a few and they don’t pan out so

well in my end. So the primer design is the key when you’re dealing with variable regions, you know.

Sean Sanders: Okay. Next question, how do you decide which sequences are real and

which ones are likely to be sequencing error? And the viewer asked if there’s any publications you can point them to.

Dr. Francois Vigneault: Well, I think the stupid easy answer is you pick the one that fits your

story and you hope to publish it, but yeah, that wouldn’t be right. Sean Sanders: All right, Dr. Smith, do you have anything? Dr. Michael Smith: Well, I think in some cases, you cannot run especially with the error

rate issues, you cannot run seeing this particular sequence enough times to be sure that it’s actually in the sample as opposed to an error.

Sean Sanders: Okay. Dr. Henry Erlich: Yeah. I think we may have a special case because we’re typically

comparing our sequence reads to existing database. So if it doesn’t match the existing database, it’s either a novel HLA allele, which we’ve identified a number of, or it’s a sequencing artifact. And I guess when I say sequencing artifact, it could be either be a PCR misincorporation or a pyrosequencing artifact in a 454 system.

And, you know, one criteria and as Mike points out is the number of

sequence reads you recover, but with 454, since it’s based on pyrosequencing, a number of ‐‐ there’s a systematic issue that one is aware of that is in homopolymeric tracts, you know, one can either get N+1 or N‐1 sometimes. And so if you’re aware of that and if you get coverage in both directions, you can usually correct that. So often, we’ve found that a run of G say that’s really 4 might be 5 in one direction but 4 in the other, and then if 4 matches the database, you know what the correct answer is. If you get 5 in both directions and you get lots of numbers on it, then in that case, we infer that it’s a new allele.

Sean Sanders: Okay. That’s great. That actually answers another question, whether it’s

important to sequence in both directions for short reads. Dr. Henry Erlich: I think it’s helpful. I think it’s not absolutely essential but I prefer when

possible now with the long reads that we can achieve with Titanium 450 base pairs. We can actually do the HLA exons in both directions. With the standard chemistry, you know, we got very good overlap, but

there were parts of the exon that were sequenced only in one direction and so this gives us added confidence.

Sean Sanders: Right. And a follow‐up question to that is what is the minimum read

count required to detect a 1% variant? Quick calculation in your head. Dr. Henry Erlich: Well, I can just give you some examples. In the SCIDS patient, we had

1235 reads for the inherited HLA‐C alleles and 15 reads for the non‐transmitted maternal variants. So those are the kind of numbers where I feel pretty confident. But if the numbers were substantially lower than that, and that was about a 1% obviously in terms of sequence read recovery, if the numbers were less than that, I’d be a little less confident.

Sean Sanders: Uh‐hum. I’m going to broaden the discussion a little bit and ask Dr.

Smith, how do you think next‐generation sequencing technologies can be used to identify regions of variability in species with low genetic diversity, say maybe endangered species?

[0:55:10] Dr. Michael Smith: That’s a good question, and I think if you’re talking ‐‐ if the goal was to

identify genetic diversity in those species, probably a very reasonable approach is to pool DNA samples in advance since you’re looking for variants and then look to see alleles enough times to be confident that they’re there and not an error of the platform. It’s certainly one of the approaches I would take if I were working on that problem.

Sean Sanders: Great. Dr. Vigneault, for you, long reads generated by a 454 sequence

are especially suitable for microbiome sequencing, and I believe you’ve done some work with microbiomes. Are there any labs commercially offering 16S RNA‐based microbiota analysis that you know of?

Dr. Francois Vigneault: Not servicing, but I mean, I don’t think the assay is quite hard to do. We

have a few people in the lab doing microbiome and I have helped one of them developing a pipeline for Illumina. It depends on the application. In our lab, most of the people have been doing more like a short version sequencing like 75 base pair just to get the classification. So here, the goal was more about getting a lot, a lot of reads so you can capture a lot of bugs. But there’s other application where you need the full 16S to be able to diversify like subspecies very deeply, and then you need a longer read language for 454 to come in.

And we were having some discussion about this just before this, and it sounds to me that it would be time for someone just to do a full like run of the full 16S and then compare with what’s been done in other lab and see how that correlates and what’s missing in each technique, and that would be a very useful paper to see out there.

Sean Sanders: Uh‐hum. Dr. Francois Vigneault: But yeah, it’s not that hard to do so I’d say just knock on door, and I

don’t just see the commercial application just yet for microbiomes so it’s more fundamental research. So it’s going to take a while before a company offers that I think.

Sean Sanders: Okay. Dr. Michael Smith: We’re running those sorts of experiments in the core labs at the NCI.

That’s not commercial but it’s available to NCI investigators. Sean Sanders: Okay, good to know. We’re coming to the end of our hour but I’m going

to try squeezing a couple of more questions. The first is what do you believe is the bottleneck that prevents the application of next‐gen sequencing to high throughput genetic diagnosis or diagnostics, in other words, thousands of samples at a low price? So Dr. Erlich, let’s start with you.

Dr. Henry Erlich: Well, I’d say it’s probably the automation of the library preparation or

not so much the library preparation but the preparation of the enriched beads following the emulsion PCR. So I had a slide that unfortunately I sort of skipped in the interest of time, but the sequencing itself on the 454 platform and the data analysis with Conexio Genomics is very quick. It’s a day for sequencing and the analysis as well, but it’s about four days of preparation from the genomic PCR to capturing the DNA‐containing beads from the emulsion PCR. And for the alpha site study with the eight participating laboratories, that was done all manually because they didn’t have access to automation.

In our own lab, we’ve automated significantly with robots. But I think

one of the great hopes is the approach that Michael was talking about, the Fluidigm access array, and I think if that can be standardized and used to automate the front end of the PCR, the multiplex PCR, and also 454 has developed something called REM e Module for automating the bead washing and capture after the emulsion PCR. I think if those two steps are implemented in these automated procedures, I think that will really have a very significant impact on the ultimate implementation of

next‐gen sequencing in both high throughput research labs and also in clinical labs.

Dr. Francois Vigneault: One thing we did in the Church lab, I have been involved with that a lot.

That’s we got tired of doing the emulsion PCR for the Polonator so we got rid of emulsion. We made the DNA nanoball. We called them raw DNA and then CGI coined DNA nanoball later on. But basically, the reaction of making a genomic human library to a sequencible DNA nanoball takes about half a day, you know. So it’s very quick. No more emulsion like problem. It’s very, very quick, very convenient, and you can get a billion featured basically on your flow cell which is immense. So if we could hook that up to long read length sequencing chemistry, that would just be awesome, you know.

Sean Sanders: Uh‐hum, uh‐hum. So I’m going to end off 'cause we’re at the end of our

hour with a question to all of you about the future. [1:00:00] And so I’m going to paraphrase a question that just came in actually

asking about nanoball sequencing, but where do you think the sequencing next‐gen, next, next‐gen, fourth‐gen space is going in the next few years? I know it’s a big question, it’s a big area, and there’s a lot of competition, but I will start with Dr. Smith and your views.

Dr. Michael Smith: We were talking before this about just the naming conventions of next

gen and third gen and fourth gen and it’s almost like it’s spreading out into a variety of areas ‐‐ nanopore, single molecules. It’s really hard to know where it’s going. There’s a lot of exciting new technologies coming along and some of which probably are in stealth mode we don’t even know about right now.

Sean Sanders: Uh‐hum. Dr. Vigneault? Dr. Francois Vigneault: I’ll give a shout‐out to some of the guys I know at Halcyon Molecular.

It’s with H‐A‐L‐C‐Y‐O‐N. Those guys are doing something crazy with TM microscopy and I don’t know if it’s going to work, but if they can pull it off, I think they’re going to change the world with this.

As for nanopore, I mean, there is some evidence that it’s going to

happen. We just don’t know when and how good it’s going to be, but that’s definitely going to be interesting to watch. When it comes down to cost, you know, it has to be free sequencing at some point, you know, where maybe the insurance company is paying for sequencing,

you know, something like that. But that’s what the Personal Genome Project is trying to achieve, you know, get that going, you know.

Sean Sanders: Right. Dr. Erlich, final word. Dr. Henry Erlich: Well, I’ve been focusing as I mentioned on clinical applications and

research applications of amplicon sequencing with the existing next‐generation technology, but I think the nanopore technology of which I’m not that familiar with, I’ve heard a few talks, I think that offers great potential. But I think it’s realistically speaking 5 or 10 or maybe even 15 years away, but I think we need to keep track of that obviously because what we’re doing now is certainly not what we’re going to be doing 10 years from now.

Sean Sanders: Right. Excellent. Well, unfortunately we are out of time so I’d like to thank all of our

panelists for being with us today and for their very interesting presentations and for generously sharing their knowledge and expertise with us: Dr. Michael Smith from SAIC‐Frederick and National Cancer Institute, Dr. Francois Vigneault from the Harvard Medical School, and Dr. Henry Erlich from Roche Molecular Systems and Children’s Hospital Oakland Research Institute.

A big thank you to our viewers for the questions that you submitted.

Apologies if we didn’t have time to get to yours. Please go to the URL at the bottom of your slide viewer that is up now to learn about products related to today’s discussion and look out for more webinars from Science available at www.sciencemag.org/webinar. This webinar will be made available to view again as an on‐demand video within approximately 48 hours from now.

Please share your thoughts about the webinar with us at any time by

sending us an email at the address now up in your slide viewer; webinar@aaas.org.

Again, thank you to the panelists and thank you to Roche and 454

Sequencing for their kind sponsorship of today’s educational seminar. Goodbye. [1:02:58] End of Audio

transcript exploring genetic diversity - mapping the genetic … · 2017-01-25 · multiplexed pcr...

Documents

dna barcoding: a genetic and morphological analysis of

metagenomics reveals flavour metabolic network of cereal...

metagenomics. what is metagenomics cloning genes from the...

barcoding plants

dna barcoding

the metagenomics sequencing service cd genomics....

genetic diversity and dna barcoding of yam accessions from...

bracken: estimating species abundance in metagenomics...

metagenomics -...

parks kmer metagenomics

non-targeted diagnosis of respiratory viral infections by...

dna barcodes reveal high genetic diversity in philippine...

denbi metagenomics workshop

nbis-metagenomics documentation

future of metagenomics

k-mers in metagenomics -...

metagenomics and...

ion reporter metagenomics 16s algorithms...

phylotastic metagenomics

2009 hattori metagenomics