The SMRTer Way: Single Genes to Complex
Genomes
Ulf Gyllensten, Professor
Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
Topics
• National Genomics Infrastructure (NGI).
• PacBio from single genes to complex
genomes.
• Among the five largest European sequencing
centers.
• Core facility open to Swedish research groups.
• MPS sequencing, Sanger sequencing and genotyping.
• Funded as a National Research Infrastructure by
SciLifeLab, Swedish Research Council (VR-RFI) and
KAW Foundation.
National Genomics Infrastructure (NGI)
MPS technologies at NGI
Short-read MPS Long-read MPS
Analysis cluster and storage of MPS data
• ~3 M cpuh/month on a
dedicated cluster
• ~7 PB storage.
• Long-term storage in
archive.
• CPU with extra large
memory (2TB)
…assembled genomes From reads to….
PacBio sequencing at NGI/Uppsala
Two Pacific Biosciences RSII systems
June 2013 August 2014
PacBio – Data production in Uppsala
Assembly projects
• BACs, YACs, fosmids, plasmids, • Gram positive and negativembacteria • Archaea • Parasitic protists • Fungi (yeasts, mushrooms) • Algae • Mosses • Higher plants • Worms • Butterflies, Insects • Birds • Lizards • Fish • Mammals
Applications on PacBio
Non-clinical applications Clinical applications
Complete genomes
BACs/YACs/plasmids
16S rRNA
Gap filling
Whole transcriptome sequencing
Isoform discovery
Amplicon sequencing
Mutation detection
Haplotype phasing
Target re-sequencing
Metagenomics
Procaryotic methylation
Chronic Myeloid Leukemia
Acute Myeloid Leukemia
HLA sequencing
Repeat expansions
Infection screening
PacBio applications
C. Targeted sequencing
A. Small genome assembly
B. De novo complex genome assembly
Small genome assembly
- PacBio the method of choice for small genomes.
- Sample quality is crucial. Good quality – an (almost)
complete genome, poor quality – partial or no genome.
Example:
Complex genomes: De Novo Assembly of Rabbit Genome Two for one genome : Assembly of an F1-hybrid between two subspecies of rabbit. PI: Professor Leif Andersson, Uppsala
Aims:
• Create a New reference assembl/y/ies
• In depth characterization of loci exhibiting strong allele frequency shifts around hybrid zone between O. c. coniculus and O. c. algirus in Spain.
• 2 % of genome shows dramatic reduction in ability to spread to other side, rest of genome leaks into other side.
Lagomorpha
The order Lagomorpha consists of two families: Leporidae (hares and rabbits) and the Ochotonidae (pikas) Likely radiated from common ancestor in Asia 60 million years ago European rabbit (Oryctulagus coniculus) and the closest extant species, the hispid hare (Caprolagus hispidus)in South Asia diverged approximately 7-10 million years ago, like most of the Leporidae
Evolutionary History of Lagomorphs in Response to Global Environmental Change, PLoS One, April 2013 | Volume 8 | Issue 4
O. c. algirus
O. c. coniculus
Dispersal to southern France
Origin and domestication of the European rabbit (O. cuniculus)
Strategy and challenges
• 300 SMRT cells (around 200Gb) run in Uppsala
– O.c.c x O.c.a hybrid
• 6 BioNano runs (by BioNano)
• Parents of F1-hybrid sequenced to 30x using PCR free Illumina libraries.
• BAC-ends and phosmids from Sanger assembly
– (250k & 2 million respectively)
• Sanger assembly OryCun2 (2.74 Gb)
• Falcon diploid assembly attempted
• Very high heterozygosity!
De Novo Assembly of Rabbit using BioNano
6 runs conducted with 400 Gb of molecules >150kb
16
Raw Data (molecules > 150 kb) Initial Assembly High Depth Assembly
Stringent Assembly
Data input 184 Gb 367 Gb 367 Gb
Number of genome maps 3595 3651 5172
Assembly size 2.57 Gb 3.76 Gb 4.44 Gb
Genome map N50 0.87 Mb 1.4 Mb 1.07 Mb
Longest genome map 4.5 Mb 6.4 Mb 6.3 Mb
Heterozygous Genome Maps are Produced
Ref
GM
17
- WGS of patient cohorts (n=10,000 ind /year). - Establish a Genetic Variant Database for the
Swedish Population (n = 1,000).
SciLifeLab Whole Human Genome Initiative
Genomics England: 100,000 whole genomes from patients by 2017.
Population genomics projects The 1000 Genomes Project - genomes of 2500 unidentified
people from 25 populations
A. Identify a cohort that reflects the genetic structure of the
Swedish population.
B. Generate WGS data using short- and long-read MPS
technologies.
C. Establish a user-friendly database to make information
available to the research community (association analyses)
and clinical genetics laboratories.
The Swedish Genetic Variant Project
The Swedish Twin registry
• Inclusion based on twinning and distribution like
population density.
• General population-prevalence of any disease.
• 10,000 individuals have been analysed with SNP arrays.
• Identify 1,000 individuals based on genetic structure
and diversity across Sweden.
Principal components of European samples from 1,000 genomes project and 10,000 Swedish samples
Finland
Northern Sweden
Southern- Central Sweden
England and Scotland
Italy
Spain
Main genetic differentiation between Southern - Central and Sweden Northern
Individuals selected for WGS and 1000 G EUR
Northern Sweden
Southern – Central Sweden
European samples from 1,000 genomes project and 1,000 selected Swedish samples
WGS of Swedish control cohort
Step 1:
•Short-read Illumina X-Ten data to 30X coverage of the 1,000
individuals.
•Standard pipeline (GATK) for variant calling (SNP and indels).
•Construct user-friendly database for the community to make
use of the data.
•Status:
– Identification of a control cohort – Q1 2015.
– Short-read MPS – Q1 2016.
– Data base implementation – Q1 2016.
Database for genetic variants CanvasDB (CANdidate Variant Analysis System and Data Base)
• Stores genetic variants with annotations, such as prediction of the
functional consequence. • At present the 3.1 billion genetic variants in the 1000 Genomes project. • Search time not proportional to database size. • Filter tools for analyses of monogenic and complex genetic disease
analyses.
The Present Human Reference is Not Complete
•Some regions have been recalcitrant to closure with short-read MPS
technology.
•Structural variation makes it difficult to assemble a truly representative
genome.
•Long-read whole human genome sequencing provide the information.
Genome reference standards
“Platinum” genome sequence
• A contiguous, haplotype resolved representation of the entire genome.
“Gold” genome sequence
• A high-quality, highly contiguous representation.
“Silver” genome sequence
• Standards TBD.
• Non-trio, PB/BN, no Bac library.
Gold Genome Sequencing Approach
Gold Reference Genomes
Platinum Reference Genomes
The Human Reference Genomes Project
CHM1
CHM13
NA19240
HG00733
NA12878
HG00514
NA19434
New Reference Human Genome Sequences
• Platinum Genomes – CHM1 An integrated assembly of Illumina, PacBio, BAC and BioNano
data.
– CHM13 PacBio data assembly + BioNano data.
• Gold Genomes – NA19240 Yoruba trio child; assembly completed.
– HG007333 Puerto Rican trio child; sequencing in progress.
– HG00514 Han Chinese trio child, Q4 2015.
– NA19434 Luhya (Kenya) trio child, Q1 2016.
WGS of Swedish cohort Step 2:
• Establish Swedish reference genome sequences by de novo
assembly of long-read Pacific Biosciences data (+BN).
Ref genome individuals
First Swedish PacBio WGS
First PacBio Assembly
# of contigs (>=0 bp) 7708
# of contigs (>=1000 bp) 7653
Total length (>=0 bp)
2844 Mb
Total length (>=1000 bp)
2844 Mb
No of contigs 7692
Largest contig 19.5 Mb
Total contig length 2844 Mb
N50 4.35 Mb
N75 1.97 Mb
• 20 kb library
• 157 SMRT cells
• 140 Gb data (~45X)
• FALCON assembly
WGS of Swedish control cohort
Step 3:
• Targeted long-read sequencing of regions of high
medical importance (HLA, Trinucleotide expansion
repeats).
• Resolve structural variation and repeats.
• Phase variation in repetitive regions and individual
alleles.
• Study the methylation pattern in native DNA.
• Long-range PCR.
• Target enrichment by hybridisation using
DNA or RNA probe arrays.
• Amplification-free targeted enrichment.
Methods for Targeted PacBio sequencing
Long-range PCR: HLA sequencing
HLA sequencing workflow
1. LR-PCR Amplification
5. Allele identification (GenDx)
2. SMRTbell prep
3. SMRT Sequencing
4. PB Long Amplicon Analysis
Long-range PCR: FADS
• FADS region has been under selection in human evolution
• Regulates the production of Omega-3/6 fatty acids (PUFA)
• Region is associated to many traits and diseases
•Two main haplotypes in humans: Ancestral and Derived
•
FADS project - functional variant at rs174557
Functional variant for FADS1 expression identified!
But is it linked to the Ancestral or Derived haplotype?
Pan et al (submitted)
PacBio sequencing of FADS region
Hybridization capture and pooled sequencing of FADS region:
AluYe5 rs174559 rs174556
> 1.2 kb
rs174557
Derived haplotype
increases FADS1 activity
Ancestral haplotype,
reduces FADS1 activity
Results:
Targeted enrichment using DNA probe arrays
Targeted enrichment using RNA probes
Modified version of PacBio+Agilent protocol
Capture of a ~2 kb library
Reads mapped back to human genome
Off-target capture of gene not in probe design region
• MIC-B gene is captured because of high similarity to MIC-A!
MIC-B MIC_A
De novo assembly of captured region
A method to resolve structural variations and repeats
• Repeat length in example: 300-500 bp
• Difficult or impossible to resolve with short reads
Amplification-free targeted enrichment
• Using Cas9 for targeting. • Sequence native DNA. • Compatible with multiple
targets: HTT, FMR1, ALS & SCA10 in one reaction.
• Under development
Input DNA
SMRTbell library
CAS9 targeting
Sequencing
• Genome Wide Association Studies
• Exome (Re-) Sequencing
• Short-read Genome (Re-) Sequencing
• Comprehensive Short-read Genome (Re-) Sequencing
• Whole-Genome De Novo Sequencing using long-reads.
Technology Waves in Human Genome Analysis
Jim Lupski: “The Goal Is De Novo Assembly in the Clinic”
What we sequence at NGI /
Adam Ameur Bioinformatician, NGS
Ignas Bunikis Bioinformatician, NGS
Christian Tellgren-Roth Bioinformatician, NGS
Ulf Gyllensten Platform director
Inger Jonasson Facility manager
Olga Vinnere Pettersson Project coordinator
Susana Häggqvist Research engineer
NGS
Cecilia Lindau Research engineer
NGS
Ulrika Broström Research engineer
NGS
Ida Höijer Research engineer
NGS
Maria Schenström Research engineer
NGS
Nina Williams Research engineer
NGS
Magdalena Andersson Research engineer
NGS
Carolina Ilbäck Research engineer
NGS
Anna Petri Research engineer
Sequencing Service
Anne-Christine Lindström Research engineer
Sequencing Service
Who does the sequencing?
What we sequence at NGI /
THANK YOU