karen miga uc santa cruz - amazon s3...primase, dna, polypeptide 2 (prim2), mrna chr3/chr6...
TRANSCRIPT
Karen Miga
UC Santa Cruz
Assessing variation in the human genome
enables discovery research
“Much of the missing heritability (the „dark matter‟ of the
genome) will probably turn up as the technology
advances.”
- Francis Collins
Nature 464, 674-675 (2010)
CEN
CENTROMERIC REGIONS
Millions of bases of repetitive DNA
The promise of long read sequences to improve
sequence variant discovery
NO LONGER CONSIDERED “JUNK DNA”
Function of Centromeric and Heterochromatic
DNA
CENTROMERE
FUNCTION
NO LONGER CONSIDERED “JUNK DNA”
Function of Centromeric and Heterochromatic
DNA
CENTROMERE
FUNCTION CANCER
NO LONGER CONSIDERED “JUNK DNA”
Function of Centromeric and Heterochromatic
DNA
CENTROMERE
FUNCTION CANCER AGING
PacBio Long Read Sequences to Predict
Satellite Sequence Variants (satVARs)
Quality Corrected
Reads
Automated
Sequence
Characterization
Genome
SatVAR Discovery
CEN
Generate a profile of satellite
variants for a given individual
genome
CEN
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
ALPHA SATELLITE
1 2 3 4
Alpha Satellite define all normal human centromeres
CEN
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
ALPHA SATELLITE
1 2 3 4 1 2 3 4 1 2 3 4
Alpha Satellite repeats (or monomers) are commonly
found in long arrays of near-identical higher order
repeats
CEN
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
ALPHA SATELLITE
1 2 3 4 1 2 3 4 1 2 3 4
“Higher Order Repeat” Multi-monomeric Repeat Unit
Alpha Satellite repeats (or monomers) are commonly
found in long arrays of near-identical higher order
repeats
CEN
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
ALPHA SATELLITE
1 2 3 4
Satellite DNA are the primary sequence in each gap
1 2 3 4 1 2 3 4
Narrow Range of Percent ID: 94% -
100%
Alpha Satellite repeats (or monomers) are commonly
found in long arrays of near-identical higher order
repeats
CEN
CEN
Array 1
Array 2 Array 3
Each chromosome has a different centromeric
sequences
CEN
Array 1. Individual A
Array 1. Individual B
A
B
~0.5 Mb
~2.0 Mb
Higher-order arrays vary between individuals
Higher-order arrays can vary between
homologous chromosomes in the same individual
CEN
Array 1. maternally inherited
Array 1. paternally inherited
~0.5 Mb
~2.0 Mb
mat
pat
CEN
Model of Centromere Sequence Organization
Array 1
] [ n
8-mer
Array 2
] [ n
4-mer
CEN
Model of Centromere Sequence Organization
Array 1
] [ n
8-mer
Array 2
] [ n
4-mer
DELETION
(6-mer)
INSERTION
(12-mer)
Rearrangements in repeat structure
CEN
Model of Centromere Sequence Organization
Array 1
] [ n
8-mer
Array 2
] [ n
4-mer
DELETION
(6-mer)
INSERTION
(12-mer)
Rearrangements in repeat structure Shifts in repeat orientation
?
DELETION
(6-mer)
INSERTION
(12-mer)
Rearrangements in repeat structure Shifts in repeat orientation
?
Sites of Interspersed Repeats
LINE
Junction with seemingly unique DNA
Transcribed Genes
DELETION
(6-mer)
INSERTION
(12-mer)
Rearrangements in repeat structure Shifts in repeat orientation
?
Sites of Interspersed Repeats
LINE
Junction with seemingly unique DNA
Transcribed Genes
Implement a strategy to characterize satellite sequence variants with
long-read sequences
Implement a strategy to characterize satellite sequence variants with
long-read sequences
github.com/volkansevim/alpha-CENTAURI
ALPHA satellite CENTromeric AUtomated Repeat Identification
• Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, long reads allow direct inference of satellite higher order repeat structure.
87606 Error Corrected
pReads
Human Centromeric DNA
Variants:
Alpha Satellite
CHM1 GENOME
3‟ 5‟
1,9S1
2.5 kb quality corrected PacBio read
3‟ 5‟
1. Identifies clusters of monomers with high sequence
similarity (FALCON error correction module)
98% Identical
# bases # bases
3‟ 5‟
1. Identifies clusters of monomers with high sequence
similarity (FALCON error correction module)
98% Identical
# bases # bases
2. Cluster similarity threshold per read by evaluating a range of
identity values (98% to 88%, by 1% decrements)
3‟ 5‟
1. Identifies clusters of monomers with high sequence
similarity (FALCON error correction module)
3. Evaluates the spacing between monomers involved in each
cluster group
98% Identical
# bases # bases
2. Cluster similarity threshold per read by evaluating a range of
identity values (98% to 88%, by 1% decrements)
3‟ 5‟
“Regular” Repeat Structure
3‟ 5‟
5‟
3‟
3‟ 5‟
5‟
3‟
D11Z1
5-mer
CEN
chr11
1680
PacBio
preads
REGULAR
IRREGULAR
CEN
chr11
6-mer (89.9%, 391 preads)
1
2
4
5
4-mer (1.4%, 39 preads)
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2
15,559 bp pread
3 4-mer
(1.6%, 43 preads)
4-mer (1.4%, 40 preads)
INSERTION
(6-mer)
CEN
chr11
INVERSION
6-mer 2
4
5
Inversion: 1 event, junction: 236 bp
4-mer 4-mer
4-mer
1
3
In total, ~5% (4493/87606)
of all alpha satellite reads
provide evidence for
an inversion
TRACKING SITES OF INTERSPERSED REPEATS
LINE/L1 L1Hs (2384 bp) LINE/L1 L1P3 (1358 bp)
96% recent LINEs
L1Hs, LIP1, L1PA2-4
chr3 CEN
3918 preads that contain both alpha satellite and at
least 10 kb of non-alpha satellite sequence
Identify Junctions with seemingly unique DNA
Primase, DNA, polypeptide 2 (prim2),
mRNA
chr3/chr6 Paralogous (non-sat) Region
~300 kb
CHM1: LJ1101000307.1
Full Coverage
(~60x)
Low Coverage
(10x)
Full Coverage
(~60x)
Low Coverage
(10x)
“Unmapped”
Database
PacBio Long Read Sequences to Predict
Satellite Sequence Variants (satVARs)
CHM1
Genome
SatVAR Discovery
CEN
Profile of satellite DNA variants
CHM1, CHM13
TRIO data sets
(CEPH and GIAB)
Acknowledgements
Volkan Sevim
Jason Chin
Ali Bashir
github.com/volkansevim/alpha-CENTAURI
1000 Genome Sequence Data
(400 male individuals)