the segway annotation of encode data

The Segway annotation

of ENCODE data

Michael M. Hoffman

Department of Genome Sciences

University of Washington

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

4. RNA-seq

Functional genomics

ENCODE Project Consortium 2011. PLoS Biol 9:e1001046.

Chromatin immunoprecipitation

Park PJ 2009. Nat Rev Genet 10:669.

(ChIP)

ChIP sequence

sequence signal: Wiggler

• Extends tags in strand

direction

• Extension length

determined by cross-

correlation peak

• Signal only in mappable

regions

• 1-bp resolution

Anshul Kundaje

http://align2rawsignal.googlecode.com/

Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Fine-scale data

300 bp

H3K4me2

H3K27me3

Pol2 (Myers)

Sin3Ak-20

Histone modifications

Transcription factors s

ignal tr

nded r

Maher B 2012. Nature 489:46.

data sets

Maher B 2012. Nature 489:46.

data sets

Now what?

Overview

1. ENCODE Project

3. Chromatin

4. RNA-seq

Semi-automated annotation

signal tracks

interpretation

visualization

annotation

pattern

discovery

Genomic segmentation

Nonoverlapping segments

Finite number of labels

Maximize similarity in labels

Bayesian network for ChIP-seq

observed random variable

signal at position t

continuous

hidden random variable

transcription factor present at position t?

0: transcription factor is not present

1: transcription factor is present

discrete continuous

µ0 σ0

µ1 σ1

emission probability parameter

conditional relationship

TF present at position t?

discrete continuous

P(Xt | Qt = 0) ~ N(µ0, σ0)

P(Xt | Qt = 1) ~ N(µ1, σ1)

Bayesian network: 2 positions

µ0 σ0

µ1 σ1

discrete continuous

µ0 σ0

µ1 σ1

Bayesian network: 2 positions

µ0 σ0

µ1 σ1

discrete continuous

µ0 σ0

µ1 σ1

transition probability parameter

P(Qt+1 = 0 | Qt = 0) = 0.99

P(Qt+1 = 1 | Qt = 0) = 0.01

P(Qt+1 = 0 | Qt = 1) = 0.01

P(Qt+1 = 1 | Qt = 1) = 0.99

Dynamic Bayesian network (DBN)

µ0 σ0

µ1 σ1

discrete continuous

µ0 σ0

µ1 σ1

µ0 σ0

µ1 σ1

10 11 Q

Dynamic BN for segmentation segment

H3K36me3

DNaseI

discrete continuous

Heterogeneous missing data

Hoffman MM et al. 2012. Nat Methods 9:473.

Handling missing data

µ0 σ0

µ1 σ1

µ0 σ0

µ1 σ1

conditional

segment

DNaseI

discrete continuous switching

segment

H3K36me3

DNaseI

conditional

discrete continuous switching

present(CTCF)

present(H3K36me3)

present(DNaseI)

Handling missing data

segment

H3K36me3

DNaseI

present(CTCF)

present(H3K36me3)

present(DNaseI)

Length

distribution

segment

H3K36me3

DNaseI

present(CTCF)

present(H3K36me3)

present(DNaseI)

segment

countdown

segment

transition

ruler frame index Length

distribution

• Minimum segment length

• Maximum segment length

• Trained geometric length distribution

• Dirichlet prior on segment length

• Weight of prior versus observed data

Segway

A way to segment the genome

http://noble.gs.washington.edu/proj/segway/

Overview

1. ENCODE Project

3. Chromatin

4. RNA-seq

embryoblast

endoderm mesoderm

intermediate

mesoderm

lateral

mesoderm

hemangioblast

blood vessel

endothelium

hemocytoblast

mesendoderm H1 hESC embryonic

stem cell

myeloid

progenitor

lymphoid

progenitor

HeLa-S3 cervical

carcinoma cell

HepG2 hepatocelluar

carcinoma cell

HUVEC umbilical vein

endothelial cell

K562 chronic myeloid

leukemia cell

GM12878 lymphoblastoid

cervix

lymphoblast

49 49 tracks

• ENCODE K562

ChIP-seq

DNase-seq

FAIRE-seq

• 8 different labs

Input tracks

25 labels

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Picking the number of labels

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Emission parameters Each cell represents

a Gaussian.

Means are row-

normalized so the

highest mean value

for a track is red and

the lowest mean

value is blue.

Standard deviation is

proportional to the

length of the black

TSS transcription start site

GS gene start

GM gene middle

GE gene end

E enhancer

I insulator

R repression

D dead

Transcription start site (TSS)

Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Rediscovering genes

Zooming out 10×

TSS segments

occur near 5’

ends of genes

TSS/G*

segments

missing in

gene deserts

segments

occur more in

gene deserts

3' gene ends

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst

Lots of genes but

very few TSS/GS

segments. Why?

Because these genes

are not expressed in

K562. A p

uzzlin

Experimental validation

http://switchgeargenomics.com/products/promoter-reporter-collection/

Testing <1000bp sequences for promoter activity

• predicted + in K562

• predicted – in K562

predicted + in GM12878

predicted – in GM12878

Comparison with GWAS catalog

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison

Summary of results

Semi-automated genomic annotation

begins with pattern discovery from multiple

functional genomics data sets and enables:

• A simple annotation with a single label for

each part of the genome.

• Visualization reducing multivariate data to

a comprehensible representation.

• Interpretation of the context and potential

regulatory impact of variants.

Software availability

• Segway data tracks segmentation

http://noble.gs.washington.edu/proj/segway/

• Segtools segmentation plots and summary statistics

Buske OJ et al. 2011. BMC Bioinformatics 12:415

http://noble.gs.washington.edu/proj/segtools/

• Genomedata efficient access to numeric data anchored to genome

Hoffman MM et al. 2010. Bioinformatics 26:1458. http://noble.gs.washington.edu/proj/genomedata/

Acknowledgments

University of Washington: Harshad

Petwe, Meg Olson, Sheila Reynolds,

Noble Research Group. University

of Massachusetts Medical School:

Zhiping Weng. SwitchGear

Genomics: Patrick Collins. Stanford

University: Anshul Kundaje.

Pennsylvania State University:

Ross Hardison, Bob Harris.

European Bioinformatics Institute:

Ewan Birney, Ian Dunham.

University of California, Santa

Cruz: Kate Rosenbloom, Brian

Raney. Cold Spring Harbor

Laboratory: Tom Gingeras, Carrie

Davis. CRG: Sarah Djebali. RIKEN:

Timo Lassmann.

ENCODE Project Consortium.

NIH/NHGRI:

K99HG006259, U54HG004695.

Bill Noble Jeff Bilmes Orion Buske Paul Ellenbogen

the segway annotation of encode data

Documents

the segway patroller i2 se the segway patroller i2 se the...

turitarjeta segway

analog mini segway aaron yeiser · analog mini segway aaron...

segway presentation

segway segway segway segway segway segway segway

segway broschüre

segway brochure

segway -...

presentation segway

20400-00002 lock indd - segway® lock 20401-00001 dispositif...

bedienungsanleitung - mindways segway · segway pt ist der...

segway logo_gothisway_felülgörbe

genomedata, segway and segtools: how to use the segway...

review free egasp: the human encode genome annotation...

segway referencemanual

user manual - segway · congratulations on the purchase of...

review free egasp: the human encode genome annotation ......

thoughts on encode annotations mark gerstein. simplified...

segway clone

gencode: the reference human genome annotation for the...