introduction to genome biology - fastagenome browser () from hoffman et al, nucl acid res 41:827,...

12
Introduction to genome biology Lisa Stubbs We’ve found most genes; but what about the rest of the genome? Most notably: Coding gene number is relatively constant in metazoans, BUT Number of alternative transcripts per gene and Gene density are not Each gene gives rise to many more isoforms: protein sequence diversity Much more non-coding DNA, including gene regulatory DNA Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000 ~23000 ~21000 # transcripts ~7000 ~50000 ~29000 ~53000 ~93000 ~200000 Kb/gene 1714 bp 4750 bp 12143 bp 57,692 bp 117381 bp 152381 bp *data taken from ENSEMBL genome browser www.ensembl.org

Upload: others

Post on 11-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to genome biology

Lisa Stubbs

We’ve found most genes; but what about the rest of the genome?

Most notably: •  Coding gene number is relatively constant in metazoans, BUT •  Number of alternative transcripts per gene and Gene density are not

–  Each gene gives rise to many more isoforms: protein sequence diversity –  Much more non-coding DNA, including gene regulatory DNA

Genomesize*12Mb 95Mb 170Mb 1500Mb 2700Mb 3200Mb#codinggenes ~7000~20000 ~14000 ~26000 ~23000~21000#transcripts ~7000~50000 ~29000~53000 ~93000 ~200000Kb/gene 1714bp4750bp12143bp57,692bp117381bp152381bp

*datatakenfromENSEMBLgenomebrowserwww.ensembl.org

Most traditional studies have focused on promoters

and nearby (proximal) enhancers

•  Promoter regions are most likely to be involved in recruiting RNA polymerase and related proteins –  TATA binding proteins (TAFs) –  General transcription factors (GTFs) –  Mediator complexes

•  Some transcription factors (TF) are also more likely to be found at

promoter sites –  SP1, E2F family are classical examples

•  BUT, most other metazoan TFs are found preferentially at distant sites –  Introns, intergenic regions –  Some may be 100s or 1000s of bp from the target promoter, or even

embedded within neighboring genes

Transcription factors and their binding sites •  Most known TFs have short, and variable binding sites, e.g.

•  BUT The probability of finding a string such as the Yy1 “core” (even as a simple string, rather than a matrix) is (1/4)4 = 1/256 bp!

–  Most TFBS are not much more specific than this!

•  So, how to raise the probability that the site you find is functional? 1.  Interspecies conservation: sites that are found in similar locations in

diverse species are more likely to be functional 2.  Site clustering: most TFBS form homo- or heterodimers that

significantly stabilize binding and influence function 3.  Location within regions that are known to be in an “open” state in the

cell type and conditions of interest

YY1 SP1 Mzf1

How to find the regulatory needles in the haystack?

•  Vertebrate genomes are mostly non-coding –  ~2% coding; ~5% noncoding and evolutionarily conserved (at the DNA sequence

alignment level) •  Websites to view pre-aligned sequence conservation levels abound; e.g. the

ECR browser http://ecrbrowser.dcode.org/ •  zPicture and Mulan provide “do it yourself” tools for pairwise or multi-

sequence alignments of up to 1Mb; http://zpicture.dcode.org/, http://mulan.dcode.org/

•  All three tools allow detection of conserved TFBS from Transfac, Jaspar, and other databases

Conserved motifs are more likely to be functional…

•  As long as the biology you are interested in is also conserved –  Important to consider the appropriate species for comparisons

ECRdetails:Step2

SummaryofconservedTFBS

SpaWaldisplayOfconservedTFBS

Focusing on accessible chromatin

•  Even well conserved motifs cannot be accessed in closed regions of chromatin

accessible

Notaccessible e.g.H3K9Me3,H3K27Me3

e.g.H3K27Ac

How to find active elements? Chromatin immunoprecipitation with TF and

histone-modification antibodies

•  Chromatin and attendant proteins are chemically crosslinked (lightly) using formaldehyde

–  Crosslinking will also attach proteins to each other, so that detection of secondary chromatin interactions is inevitable

•  Cross-linked chromatin is randomly sheared by sonication (average fragment size 200-500bp)

•  Sonicated fragments in solution are exposed to a protein-specific antibody

•  Antibody is retrieved with DNA still attached

•  DNA is released with salt and heat (reverses the crosslinks)

–  Library is created for sequencing : ligation of “tags” and light PCR amplification

–  Sequenced directly e.g. illumina sequencing

+

ATGGCCTTAACGA…..

Sequence-based ChIP approaches… •  Harness ChIP, DNAse

sensitivity, and other assays, to Illumina sequencing

–  ChIP enriched DNA is ligated to Illumina linkers and sequenced directly

–  If you experiment works, you’ve enriched a very small fraction of the genome:

–  Requires a lot of input chromatin! Traditional methods need ~10^7 cells per experiment!!

–  Critical step is an efficient, selective antibody (and very few exist)

ChIP computational issues

•  Sequence is read from randomly position ends of multiple, overlapping randomly sheared fragments –  Reads will be scattered around a distance ~2X shear fragment length; –  ChIP seq reads surround but may not contain the DNA binding site

•  Computational tools (like MACS) need to join adjacent sets of read peaks and define a “shift” distance between read peaks to determine a summit

Bindingsite

ChIPfragmentsSeqreads

Analytical considerations •  Genomic neighborhoods

–  Shear efficiency is not really “random” •  Some genomic regions are fragile and sensitive; some are

protected •  Chromatin-matched, co-sheared controls are essential •  Most peak-finders are strongly biased to compare controls and

experimental with similar numbers of reads •  Repeatability is key

–  Biological, or at least technical, replicates are also essential

–  Artifactual peaks are very easy to generate! –  Other ways to validate:

•  Known targets •  Known motifs •  Similar targets in different cell types or tissues

•  Peak width –  Transcription factors typically yield sharp peaks; chromatin marks are

sometimes broader and more diffuse

•  User-friendly tools –  MACS:

•  ‘Model based” peak detection, is sensitive to peak enrichment and background

•  Zhang et al, Genome Biology 2008, Feng et al. 2012, Nat Procols PMID: 22936215 (Xiaole Liu lab);

•  MACS1 is best for sharp peaks (TFs); will break diffuse peaks into smaller regions

•  MACS2 is designed to allow broad- or sharp-peak detection –  HOMER (http://homer.salk.edu/homer)

•  Can be easily tweaked for more sensitive peak detection •  Comes packaged wiith a rich set of peak annotation tools •  Tools for DNAse-seq, High-C, differential ChIP analysis and many more

–  Both tools permit generation of “wiggle files” or similar that can be viewed in the UCSC browser

•  Looking at your data is a very important step! Peak finders can miss peaks that you can easily see by eye!

Differential ChIP and connection to differential expression

•  Just like differential sequence analysis

–  comparison requires rigorous normalization

•  Normalization is complicated for ChIP

–  peak height? Peak shape? Summit position? Read density? Local neighborhoods?

–  Not as simple as an intensity score or a yes/no count

•  Chromatin dynamics and expression dynamics

–  *might* or *might not* be temporally coordinated

Scalechr15:

Spliced ESTs

Mouse mRNAs

5 kb mm976,304,000 76,305,000 76,306,000 76,307,000 76,308,000 76,309,000 76,310,000 76,311,000 76,312,000 76,313,000

UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics)

94-95 Frontal Cortex 120 min control samples 1+2 1M cells H3K4me3 ChIP

99-100 Frontal Cortex 120 min exp samples 1+2 1M cells H3K4me3 ChIP

42-46 Frontal Cortex 30 min control sample 1+2 5M h3k27ac

41-45 Frontal Cortex 30 min experimental sample 1+2 5M h3k27ac

69-70 Frontal Cortex 120 min control sample 1+2 4M cells h3k4me1

72-73 Frontal Cortex 120 min experimental sample 1+2 4M cells h3k4me1

108+109 Frontal Cortex 120 min exp samples 1+2 5M cells H3K27me3 ChIP

108+109 Frontal Cortex 120 min control samples 1+2 5M cells H3K27me3 ChIP

Cortex 8w H3K27ac Histone Mods by ChIP-seq Peaks from ENCODE/LICRCortex 8w H3K4me3 Histone Mods by ChIP-seq Signal from ENCODE/LICR

Cortex 8w H3K4me1 Histone Mods by ChIP-seq Signal from ENCODE/LICR

Cortex 8w H3K4me1 Histone Mods by ChIP-seq Peaks from ENCODE/LICR

Mouse ESTs That Have Been Spliced

Cortex 8w H3K4me3 Histone Mods by ChIP-seq Peaks from ENCODE/LICR

Cortex 8w H3K27ac Histone Mods by ChIP-seq Signal from ENCODE/LICR

Mouse mRNAs from GenBank

Bop1Hsf1Hsf1Hsf1Hsf1Hsf1

94-95 FCX120 CK1+2 1M H3K4me3 ChIP200 _

5 _

99-100 FCX120 EX1+2 1M H3K4me3 ChIP200 _

5 _

42-46 FCX30 CK1+2 5M h3k27ac ChIP70 _

5 _

41-45 FCX30 EX1+2 5M h3k27ac ChIP70 _

5 _

69-70 FCX120 CK1+2 4M h3k4me1 ChIP40 _

5 _

72-73 FCX120 EX1+2 4M h3k4me1 ChIP40 _

5 _

108+109 FCX120 EX1+2 5M H3K27me3 ChIP30 _

5 _

108+109 FCX120 CK1+2 5M H3K27me3 ChIP30 _

5 _

?

Data from ChIP with TFs, modified Histones, and other proteins are available for human (and to some degree, mouse and flies) as Tables in the UCSC

genome browser (www.genome.ucsc.edu)

FromHoffmanetal,NuclAcidRes41:827,2013

Yet another example of why you should “look at your data”

Scalechr17:

Mouse mRNAs

Spliced ESTs

5 kb mm935,095,000 35,100,000 35,105,000

Hspa1b Hspa1a

94-95 FCX120 CK1+2 1M H3K4me3 ChIP200 -

5 _

99-100 FCX120 EX1+2 1M H3K4me3 ChIP200 -

5 _

42-46 FCX30 CK1+2 5M h3k27ac ChIP70 -

5 _

41-45 FCX30 EX1+2 5M h3k27ac ChIP70 -

5 _

69-70 FCX120 CK1+2 4M h3k4me1 ChIP30 -

5 _

66-67 FCX120 EX1+2 1M h3k4me1 ChIP20 -

5 _

108+109 FCX120 CK1+2 5M H3K27me3 ChIP30 -

5 _

108+109 FCX120 EX1+2 5M H3K27me3 ChIP30 -

5 _

Transposon-based alternatives •  These tools address an important issue:

–  Library preps fail unless you start with significant ChIP input

– How to work with samples for which millions of cells are not available?

•  Solution –  Library prep without linker ligation – A transposon brings in the essential Illumina (or

other) primers –  Library prep is completed simply with PCR – The need for substantial input DNA is removed

TN5

transposase

inserWon

(e.g.Illuminalibraryoligos)

tagmentaWon

ConWnuedreacWon

PCR

Readytosequence

ChIP tagmentation •  Regular ChIP prep

•  Treat with transposase and tag oligos while chromatin is still on the beads

•  Release after tagmentation, PCR, size-select and sequence (no library prep!)

Issues related to tagmentation •  Ratio of DNA: transposase

– Has to be adjusted for each cell type and chromatin prep

•  Need even fragmentation to avoid bias, and small enough fragments, in general, for illumina

•  Need to avoid making fragments too small •  Bias observed in DNA: controls are complicated

•  Solution in “ChiPmentation” – Tagmentation while DNA is still protected by the

antibody and cross-linked chromatin, still on the bead

•  Protects from over-tagmentation, this allowing a full digestion without fear of losing the DNA

•  Allows the protocol to work over a 25X range of DNA: transposon and lessens worries about time

Illumina-owned kit is expensive but…

GenomeRes24:2033–2040

Genome Biology Topic overview •  Lectures

–  Ross Hardison •  Basics of gene regulation, epigenetics and ENCODE results

–  David Hawkins •  Chromatin states, biological applications

–  James Taylor •  Higher dimension chromatin structure

–  Lisa Stubbs •  Integrating data for biological inference: Basics of Expression correlation methods

•  Workshops –  Bowtie and MACS on Galaxy –  Peaks to features in Galaxy –  Bowtie and MACs / Tophat->Cuffdiff on the command line –  Monday: student’s choice

•  “How to” for ECR browser and Z-picture (sequence alignments and conserved motifs) •  Simple methods for expression correlation: Cluster and Cytoscape •  ChIP peaks to Meme-ChIP (online connection to the meme suite for large peak sets) •  DAVID functional clustering analysis (GO and pathway analysis tools online