processing, integrating and analysing chromatin...

Post on 07-Aug-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Processing, integrating and analysingchromatin immunoprecipitation followed

by sequencing (ChIP-seq) data

DATA SCIENCE AND MACHINE LEARNING FOR BIOINFORMATICS

ALEX ESSEBIER

The regulatory system

Dynamic, sequential gene expression changes required for normal development

Histonemodifications

Chromatinaccessibility Chromatin state

Sequence features

Transcription factorbinding

Cis regulatorymodules

Gene expression

Distal or proximalinteractions

• Chromatin immunoprecipitation (ChIP-seq)

• To determine protein binding sites in vivo

• Transcription factor binding & histone modifications

• DNase I hypersensitivity (DHS) (also ATAC-seq and FAIRE-seq)

• Chromatin accessibility

• Protein binding footprints

High throughput sequencing (HTS) data

• RNA-seq and cap analysis gene expression (CAGE)• Gene expression

• Alternate transcription start site usage

• Changes in expression (temporal or perturbation)

High throughput sequencing (HTS) data

• Chromosome conformation capture • DNA looping

• Long distance enhancer/promoter interactions• 3C, 4C, 5C, Hi-C, CHi-C, ChIA-PET…

High throughput sequencing (HTS) data

ChIP-seq

GENERATING AND INTERPRETING CHIP-SEQ DATA

ChIP-seq experiment

Wet lab

• Extract DNA fragments bound by protein of interest

ChIP-seq sequence analysis

• Sequence depth

– Depends on size of genome and type of protein

• Mammalian transcription factor

20 million reads

• Sequence quality

– Read quality summary

– Unusual base pair patterns

– Adapter sequences

– E.g. FastQC tool

Sequencing & quality control

ChIP-seq alignment

Alignment & quality control

• Alignment summary

– Generated by alignment tools

– Uniquely aligned reads

ChIP-seq alignment

Alignment & quality control

• Alignment summary

– Generated by alignment tools

– Uniquely aligned reads

• Read distribution quality

– ChIP-seq peaks create bimodal pattern

– Strand cross correlation analysis (SCCA)

Peak calling

GENERATING AND EXPLORING CHIP-SEQ PEAKS

Basic principles of peak calling

InputNo antibody exposure

Sample Exposed to antibody

Peak With statistical significance

Compared to

To generate

Peak calling tools

• Everyone seems to have built their own!

• Omic Tools reports 86 ChIP-seq tools• 51 in 2016

• In-house tools

• Find or adapt a tool that fits your needs

• Potentially, develop your own

Be aware!Not all tools are well documented or tested

https://xkcd.com/927/

So, how do I choose a tool?

• Choice of tool depends on problem• Based on experiment itself

• Based on sequence data and read distribution

Combining tools and replicates

Peak caller Total Unique

MACS2 42,536 12%

HOMER 45,044 19%

SPP 19,474 0.7%

• Use multiple peak callers • Combine outcomes (e.g. common peaks)

• Take advantage of strengths of different peak callers• Bolster weaknesses

• Biological replicates can vary significantlyCall peaks for replicates individually

Compare/overlap to achieve

‘golden standard’

• Comparisons are dominated by poor replicate

10,000 peaks

3,000 peaks

Peak quality control• Number of peaks

• Motif enrichment

• Read coverage• Fraction of reads in peaks (FRiP) > 1%• Generally observe > 10%

• E.g. below 6/50 reads in peak -> 12%

5T0U (PDB)Hashimoto et al. 2017

CTCF and DNA

ChIP-seq complications

• ChIP-seq generates peaks for all of these events

Integrating data

COMBINING DATA SETS TO IMPROVE OUTCOMES

Data integration

• Experiments capture dependent regulatory events • ChIP-seq – regulatory elements

• DHS – chromatin accessibility

• RNA-seq – expression patterns

• Consider multiple datasets to:• Improve confidence

• Improve understanding

• Support hypotheses

https://marketoonist.com/2014/01/big-data.html

Supporting histone modifications

• Explore chromatin environment • Layer/overlap histone modifications

• DHS – chromatin accessibility

Supporting transcription factors

• Transcription factors preferentially bind open/active chromatin and regulatory regions

• Alter expression of genes

• RNA-seq on knock-out of transcription factor• Identify genes with significant change in expression

System complexity

• Small number of differentially expressed genes are bound by target transcription factor

• System redundancy

• Indirect changes in expression

Ma et al. 2014

More complicated relationships

• Overlapping is simple

• Next steps: pattern identification, prediction, classification

• Machine learning approaches

• Hidden Markov model to classify epigenetic state

• Bayesian network to predict transcription factor binding events

• Random forest of decision trees to predict long distance enhancer/promoter interactions

Take home messages

• Understand your data and how best to use it

• Quality control!

• Peak calling• Use multiple tools where possible

• Keep up to date with advances

• Data integration• Combine resources and data to gain a more complete picture

Resources

• Data/figures• Klisch, T. J., Xi, Y., Flora, A., Wang, L., Li, W., & Zoghbi, H. Y. (2011). In vivo Atoh1 targetome reveals how a proneural transcription

factor regulates cerebellar development. Proceedings of the National Academy of Sciences,108(8), 3288-3293. • Frank, C. L., Liu, F., Wijayatunge, R., Song, L., Biegler, M. T., Yang, M. G., ... & West, A. E. (2015). Regulation of chromatin accessibility

and Zic binding at enhancers in the developing cerebellum. Nature neuroscience, 18(5), 647-656. • Ma, S., Kemmeren, P., Gresham, D., & Statnikov, A. (2014). De-novo learning of genome-scale regulatory networks in S.

cerevisiae. Plos one, 9(9), e106479.• Hashimoto, H., Wang, D., Horton, J. R., Zhang, X., Corces, V. G., & Cheng, X. (2017). Structural basis for the versatile and methylation-

dependent binding of CTCF to DNA. Molecular cell, 66(5), 711-720.

• Useful papers • Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T., ... & Zhang, J. (2013). Practical guidelines for the comprehensive

analysis of ChIP-seq data. PLoS Comput Biol, 9(11), e1003326. • Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., ... & Chen, Y. (2012). ChIP-seq guidelines and

practices of the ENCODE and modENCODE consortia. Genome research, 22(9), 1813-1831. • Farnham, P. J. (2009). Insights from genomic profiling of transcription factors.Nature Reviews Genetics, 10(9), 605-616. • Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., ... & Liu, X. S. (2008). Model-based analysis of ChIPSeq

(MACS).Genome biology, 9(9), 1. • Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., ... & Glass, C. K. (2010). Simple combinations of lineagedetermining

transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular cell, 38(4), 576-589.

ChIP-seq complications

• Possible to observe multiple states at one genomic location

• False negatives• Can’t detect small sub-populations

• False positives• General non-specific chromatin being pulled down

• Bias not removed by input

• Replicates can resolve variation

Transcription Factors

• Confirm in vitro and in silico results• Overlapping peaks with motifs

• Identify consensus motif• For transcription factors which do not have an existing/known motif

• To identify variations in motif

• Differential peak binding• To identify differences in binding patterns

• Compare cell types or time points

Histone Modifications

• Epigenetic analysis• Generate epigenetic profiles

• Identify chromatin states genome wide• E.g. ChromHMM

• Identify regulatory modules • E.g. promoters or enhancers

• Differential peak binding• Identify differences in epigenetic patterns

Long distance regulation

• Chromosome conformation capture reports different types of interactions

• Histone modifications can identify enhancer-promoter interactions

• Filter out structural interactions

top related