karen miga uc santa cruz - amazon s3...primase, dna, polypeptide 2 (prim2), mrna chr3/chr6...

Karen Miga

UC Santa Cruz

Assessing variation in the human genome

enables discovery research

“Much of the missing heritability (the „dark matter‟ of the

genome) will probably turn up as the technology

advances.”

- Francis Collins

Nature 464, 674-675 (2010)

CEN

CENTROMERIC REGIONS

Millions of bases of repetitive DNA

The promise of long read sequences to improve

sequence variant discovery

NO LONGER CONSIDERED “JUNK DNA”

Function of Centromeric and Heterochromatic

DNA

CENTROMERE

FUNCTION



DNA

CENTROMERE

FUNCTION CANCER



DNA

CENTROMERE

FUNCTION CANCER AGING

PacBio Long Read Sequences to Predict

Satellite Sequence Variants (satVARs)

Quality Corrected

Reads

Automated

Sequence

Characterization

Genome

SatVAR Discovery

CEN

Generate a profile of satellite

variants for a given individual

genome

CEN

~171bp

Tandem Repeat

Wide Range of Percent ID: ~60-100%

ALPHA SATELLITE

1 2 3 4

Alpha Satellite define all normal human centromeres

CEN

~171bp

Tandem Repeat


ALPHA SATELLITE

1 2 3 4 1 2 3 4 1 2 3 4

Alpha Satellite repeats (or monomers) are commonly

found in long arrays of near-identical higher order

repeats

CEN

~171bp

Tandem Repeat


ALPHA SATELLITE

1 2 3 4 1 2 3 4 1 2 3 4

“Higher Order Repeat” Multi-monomeric Repeat Unit



repeats

CEN

~171bp

Tandem Repeat


ALPHA SATELLITE

1 2 3 4

Satellite DNA are the primary sequence in each gap

1 2 3 4 1 2 3 4

Narrow Range of Percent ID: 94% -

100%



repeats

CEN

CEN

Array 1

Array 2 Array 3

Each chromosome has a different centromeric

sequences

CEN

Array 1. Individual A

Array 1. Individual B

A

B

~0.5 Mb

~2.0 Mb

Higher-order arrays vary between individuals

Higher-order arrays can vary between

homologous chromosomes in the same individual

CEN

Array 1. maternally inherited

Array 1. paternally inherited

~0.5 Mb

~2.0 Mb

mat

pat

CEN

Model of Centromere Sequence Organization

Array 1

] [ n

8-mer

Array 2

] [ n

4-mer

CEN


Array 1

] [ n

8-mer

Array 2

] [ n

4-mer

DELETION

(6-mer)

INSERTION

(12-mer)

Rearrangements in repeat structure

CEN


Array 1

] [ n

8-mer

Array 2

] [ n

4-mer

DELETION

(6-mer)

INSERTION

(12-mer)

Rearrangements in repeat structure Shifts in repeat orientation

?

DELETION

(6-mer)

INSERTION

(12-mer)


?

Sites of Interspersed Repeats

LINE

Junction with seemingly unique DNA

Transcribed Genes

DELETION

(6-mer)

INSERTION

(12-mer)


?

Sites of Interspersed Repeats

LINE

Junction with seemingly unique DNA

Transcribed Genes

Implement a strategy to characterize satellite sequence variants with

long-read sequences

Implement a strategy to characterize satellite sequence variants with

long-read sequences

github.com/volkansevim/alpha-CENTAURI

ALPHA satellite CENTromeric AUtomated Repeat Identification

• Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, long reads allow direct inference of satellite higher order repeat structure.

87606 Error Corrected

pReads

Human Centromeric DNA

Variants:

Alpha Satellite

CHM1 GENOME

http://github.com/volkansevim/alpha-CENTAURI




3‟ 5‟

1,9S1

2.5 kb quality corrected PacBio read

3‟ 5‟

1. Identifies clusters of monomers with high sequence

similarity (FALCON error correction module)

98% Identical

# bases # bases

3‟ 5‟



98% Identical

# bases # bases

2. Cluster similarity threshold per read by evaluating a range of

identity values (98% to 88%, by 1% decrements)

3‟ 5‟



3. Evaluates the spacing between monomers involved in each

cluster group

98% Identical

# bases # bases

2. Cluster similarity threshold per read by evaluating a range of

identity values (98% to 88%, by 1% decrements)

3‟ 5‟

“Regular” Repeat Structure

3‟ 5‟

5‟

3‟

D11Z1

5-mer

CEN

chr11

1680

PacBio

preads

REGULAR

IRREGULAR

CEN

chr11

6-mer (89.9%, 391 preads)

1

2

4

5

4-mer (1.4%, 39 preads)

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2

15,559 bp pread

3 4-mer

(1.6%, 43 preads)

4-mer (1.4%, 40 preads)

INSERTION

(6-mer)

CEN

chr11

INVERSION

6-mer 2

4

5

Inversion: 1 event, junction: 236 bp

4-mer 4-mer

4-mer

1

3

In total, ~5% (4493/87606)

of all alpha satellite reads

provide evidence for

an inversion

TRACKING SITES OF INTERSPERSED REPEATS

LINE/L1 L1Hs (2384 bp) LINE/L1 L1P3 (1358 bp)

96% recent LINEs

L1Hs, LIP1, L1PA2-4

chr3 CEN

3918 preads that contain both alpha satellite and at

least 10 kb of non-alpha satellite sequence

Identify Junctions with seemingly unique DNA

Primase, DNA, polypeptide 2 (prim2),

mRNA

chr3/chr6 Paralogous (non-sat) Region

~300 kb

CHM1: LJ1101000307.1

Full Coverage

(~60x)

Low Coverage

(10x)

“Unmapped”

Database

PacBio Long Read Sequences to Predict

Satellite Sequence Variants (satVARs)

CHM1

Genome

SatVAR Discovery

CEN

Profile of satellite DNA variants

CHM1, CHM13

TRIO data sets

(CEPH and GIAB)

Acknowledgements

Volkan Sevim

Jason Chin

Ali Bashir

github.com/volkansevim/alpha-CENTAURI





1000 Genome Sequence Data

(400 male individuals)

karen miga uc santa cruz - amazon s3...primase, dna, polypeptide 2 (prim2), mrna chr3/chr6...

Documents