using vertebrate genome comparisons to find gene regulatory regions ross hardison and james taylor...

61
Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics Nov. 10, 2007

Upload: sharleen-craig

Post on 19-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Using vertebrate genome comparisons to find gene

regulatory regions

Ross Hardison and James TaylorCold Spring Harbor course on Computational

GenomicsNov. 10, 2007

Page 2: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Major goals of comparative genomics

• Identify all DNA sequences in a genome that are functional– Selection to preserve function– Adaptive selection

• Determine the biological role of each functional sequence

• Elucidate the evolutionary history of each type of sequence

• Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research

Page 3: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Types of sequences in mammalian genomes

• About 1.5-2% codes for protein– Almost all shows a sign for purifying selection since the

primate-rodent divergence– Does not preclude positive selection acting on smaller

regions or in specific lineages• About 45% is interspersed repeats

– 22% in ancestral repeats– Good model for neutral DNA– 23% in lineage-specific repeats

• About 53% is noncoding, nonrepetitive– Minimum of 4% of genome is under purifying selection for

a function common to mammals, but does NOT code for protein• Regulatory sequences• Non-protein coding genes• Other important sequences

– About 49% under no obvious selection: no conserved function?

Page 4: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Impact of whole-genome alignments

Guide to functional sequencesin the human genome.

Better gene predictions

Sequences under purifying selection

Conserved sequences

Sequences that look like elements that regulategene expression

Page 5: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Three modes of evolution

Page 6: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Negative and positive selection observed at different phylogenetic

distances:

Page 7: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Net

Genome-wide local alignment chains

Mouse

blastZ: Each segment of human is given the opportunity to align with all mouse sequences.

Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.

Human

Run blastZ in parallel for all human segments. Collect all local alignments above threshold.

Organize local alignments into a set of chains based on position in assembly and orientation.

Level 1 chainLevel 2 chain

Page 8: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Comparative genomics to find functional sequences

Genome size

2,900

2,400

2,500

1,200

Human

Mouse Rat

All mammals1000 Mbp

Identify functional sequences: ~ 145 Mbp

million base pairs(Mbp)

Find common sequencesblastZ, multiZ

Also birds: 72Mb

Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004

Page 9: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Regional variation in divergence rates

Page 10: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Implications of co-variation in divergence

• Large regions (megabase sized) are changing relatively fast or slow for (almost) all types of divergence– Neutral substitution, insertions (except SINEs),

deletion, recombination

• This is a consistent property of each region of genomic DNA– See similar patterns for orthologous regions on

independent lineages to mouse, rat and human

• An aligned segment with a given similarity score in a fast-changing region is MORE significant than an aligned segments with the same similarity score in a slow-changing region.

• Must take the differential rate into account in searching for functional DNA = DNA under selection.

Page 11: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Use measures of alignment quality to discriminate functional from

nonfunctional DNA• Compute a conservation score adjusted for the local neutral rate

• Score S for a 50 bp region R is the normalized fraction of aligned bases that are identical – Subtract mean for aligned ancestral repeats in the surrounding region

– Divide by standard deviation

p = fraction of aligned sites in R that areidentical between human and mouse

= average fraction of aligned sites that are identical in aligned ancestral repeats inthe surrounding region

n = number of aligned sites in RWaterston et al., Nature

Page 12: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Decomposition of conservation score into neutral and likely-selected

portions

Neutral DNA (ARs)All DNALikely selected DNAAt least 5-6%

S is the conservation score adjusted for variation in the local substitution rate.The frequency of the S score for all 50bp windows in the human genome is shown.From the distribution of S scores in ancestral repeats (mostly neutral DNA), can compute a probability that a given alignment could result from locally adjusted neutral rate.

Waterston et al., Nature

Page 13: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Conservation score S

in different types of regions

Red: Ancestral repeats (mostly neutral)Blue: First class in labelGreen: Second class in label

Page 14: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

phastCons: Likelihood of being constrained

Siepel et al. (2005) Genome Research 15:1034-1050

• Phylogenetic Hidden Markov Model

• Posterior probability that a site is among the 10% most highly conserved sites

• Allows for variation in rates along lineages

c is “conserved” (constrained)n is “nonconserved” (aligns but is not clearly subject to purifying selection)

Page 15: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Larger genomes have more of the constrained

DNA in noncoding regions

Siepel et al. 2005, Genome Research

Expected value if coverage by conserved elements is uniform

Page 16: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Some constrained introns are editing complementary regions:GRIA2

Siepel et al. 2005, Genome Research

Page 17: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

3’UTRs can be highly constrained over large distances

Siepel et al. 2005, Genome Research

3’ UTRs contain RNA processing signals, miRNA targets,other regions subject to constraints

Page 18: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Ultraconserved elements = UCEs

• At least 200 bp with no interspecies differences– Bejerano et al. (2004) Science 304:1321-1325 – 481 UCEs with no changes among human, mouse and rat– Also conserved between out to dog and chicken– More highly conserved than vast majority of coding

regions

• Most do not code for protein – Only 111 out of 481overlap with protein-coding exons– Some are developmental enhancers.– Nonexonic UCEs tend to cluster in introns or in

vicinity of genes encoding transcription factors regulating development

– 88 are more than 100 kb away from an annotated gene; may be distal enhancers

Page 19: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

GO category analysis of UCE-associated genes

• Genes in which a coding exon overlaps a UCE– 91 Type I genes– RNA binding and modification

– Transcriptional regulation

• Genes in the vicinity of a UCE (no overlap of coding exons)– 211 Type II genes– Transcriptional regulation

– Developmental regulators

Bejerano et al. (2004) Science

Page 20: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Intronic UCE in SOX6 enhances expression in melanocytes in

transgenic mice

Pennacchio et al., http://enhancer.lbl.gov/

UCEsTested UCEs

Page 21: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

The most stringently conserved sequences in eukaryotes are

mysteries • Yeast MATa2 locus

– Most conserved region in 4 species of yeast– 100% identity over 357 bp– Role is not clear

• Vertebrate UCEs– More constrained than exons in vertebrates– Noncoding UCEs are not detectable outside chordates,

whereas coding regions are• Were they fast-evolving prior to vertebrate/invertebrate divergence?

• Are they chordate innovations? Where did they come from?– Role of many is not clear; need for 100% identity over

200 bp is not obvious for any• What molecular process requires strict invariance for at least 200

nucleotides?• One possibility: Multiple, overlapping functions

Page 22: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Going beyond stringent selection in noncoding sequence to find cis-regulatory modules

Page 23: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Constraint in noncoding sequences

• Used to predict gene regulatory regions with some success

• Some sequences conserved between humans and mouse show no apparent function– Is constraint revealing

many false positives?• Sequences regulating gene

expression in restricted lineages are not constrained across mammals– Is pan-mammalian

constraint missing many functional sequences?

Tree from Margulies et al. (2007) Genome Res.

Page 24: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

phastCons can find some but not all gene regulatory regions

LCR HS1 HS2 HS3 HS4 HS5

phastCons

Locus control region, or LCR, is the major distal enhancer fo HBB and related, linked genes. It has 5 DNase hypersensitive sites covering about 20 kb.

Page 25: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Two extremes of

constraint in CRMs

CRMs= cis-regulatory modules.DNA sequences needed in cis for regulation of expression, usually transcriptionE.g. promoters, enhancers, silencers

Page 26: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Coverage of human by alignments with other vertebrates ranges from 1% to 91%

Human

0 20 40 60 80 100

Fugu

Tetraodon

Zebrafish

Frog

Chicken

Platypus

Opossum

Cow

Dog

Rat

Mouse

Chimp

Percent of human aligning with second species

5.4

9192

310

360

450

173

Millions ofyears

220

5%

Page 27: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Distinctive divergence rates for different types of functional DNA

sequences

pTRRs: putative transcriptional regulatory region; likely CRMs

Sites identified as occupied by sequence-specific transcription factors based on high-throughput chromatin immunoprecipitation assayed by hybridization to high density tiling arrays of genomic DNA= ChIP-chip

Page 28: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

cis-Regulatory modules conserved beyond mammals

310

450

91

173

Millions ofyears

• Human-chicken alignment capture about 6% of pTRRs (likely CRMs)

• Human-fish alignments capture about 3% of pTRRs.

• The pan-vertebrate CRMs tend to regulate genes whose products control transcription and development

Page 29: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

cis-Regulatory modules conserved in eutherian mammals and marsupials

310

450

91

173

Millions ofyears

• Human-marsupial alignments capture about 32% of CRMs (pTRRs)– Tend to occur close to genes

involved in aminoglycan synthesis, organelle biosynthesis

• Human-mouse alignments capture about 75% of CRMs (pTRRs)– Tend to occur close to genes

involved in apoptosis, steroid hormone receptors, etc.

• Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA.

Page 30: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Interferon beta Enhancer-Promoter

Page 31: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Expected properties of gene regulatory regions

• Can be almost anywhere– 5’ or 3’ to gene– Within introns– Close or far away

• Conserved between species (sometimes)– Examine interspecies alignments, noncoding regions– Evaluate likelihood of being under purifying selection, e.g.

phastCons score– Some regulatory regions are deeply conserved, others are

lineage-specific

• Enhancers and promoters: clusters of binding sites for transcription factors (TFBSs)– Resources and servers for finding TFBSs– TRANSFAC http://www.gene-regulation.com/– JASPAR http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl– TESS http://www.cbil.upenn.edu/cgi-bin/tess/tess– MOTIF (GenomeNet) http://motif.genome.jp/– MatInspector http://www.genomatix.de/

Page 32: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Finding known motifs in a query sequence

MatInspector at http://www.genomatix.de/K. Cartharius et al. (2006) MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics 21:2933-2942. Genomatix Software GmbH, Munchen, Germany

Query: a UCE in SOX61356 bp

About 1 in 4 bp is the start of a TFBS match!

Page 33: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Conservation of TFBSs between species

• Servers to find conserved matches to factor binding sites– Comparative genomics at Lawrence Livermore http://www.dcode.org/

• zPicture and rVista• Mulan and multiTF• ECR browser

– Consite http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite• Conserved TFBSs are available for some assemblies of human genome at

UCSC Genome Browser

Binding site for GATA-1

Page 34: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Clusters of conserved TFBSs: PReMods

Blanchette et al. (2006) Genome Research

http://genomequebec.mcgill.ca/PReMod/

Page 35: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

ESPERREvolutionary and Sequence Pattern Extraction through

Reduced Representation

Page 36: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

ESPERR: a different approach

• Don’t assume a database of known binding motifs

• Don’t assume strict conservation of the important sequence signals

• Instead, use alignments of validated examples to learn sequence and evolutionary patterns that characterize a class of elements

Page 37: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Objective of ESPERR

Page 38: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

ESPERR overview

Page 39: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Represent columns with ancestral distributions

Page 40: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Group columns using evolutionary similarity and frequency

distribution

Page 41: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

An agglomerative algorithm

Page 42: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Searching for encodings

Page 43: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Evaluate “merit” of candidate mappings

Page 44: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Iterate until convergence

Page 45: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Search convergence behavior

Page 46: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Regulatory potential (RP) to distinguish functional classes

Page 47: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Variable order Markov models for discrimination

Page 48: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Use ESPERR to compute Regulatory Potential

Page 49: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Good performance of ESPERR for gene regulatory regions (RP)

-1

Page 50: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Experimental tests of predicted cis-regulatory

modules

Page 51: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

GATA-1 is required for erythroid maturation

Aria Rad, 2007 http://commons.wikimedia.org/wiki/Image:Hematopoiesis_(human)_diagram.png

MEP Hematopoietic stem cell

Commonmyeloidprogenitor

Myeloblast

Basophil

Commonlymphoidprogenitor

Neutrophil

Eosinophil

Monocyte, macrophage

GATA-1G1E cells

G1E-ER4 cells

Page 52: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Genes Co-expressed in Late Erythroid Maturation

G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1Rylski et al., Mol Cell Biol. 2003

Page 53: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes

Page 54: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

preCRMs with conserved consensus GATA-1 BS tend to be active on transfected

plasmids

Page 55: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome

Page 56: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Examples of validated preCRMs

Page 57: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Correlation of Enhancer Activity with RP Score

Page 58: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Validation status for 99 tested fragments

Page 59: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be

Validated

Page 60: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Conclusions

• Particular types of functional DNA sequences are conserved over distinctive evolutionary distances.

• Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection).

• Patterns in alignments and conservation of some TFBSs can be used to predict some cis-regulatory elements.

• The predictions of cis-regulatory elements for erythroid genes are validated at a good rate.

• Databases and servers such as the UCSC Table Browser, Galaxy, and others provide access to these data.– http://genome.ucsc.edu/– http://www.bx.psu.edu/

Page 61: Using vertebrate genome comparisons to find gene regulatory regions Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics

Many thanks …

Wet Lab: Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King

PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko

Alignments, chains, nets, browsers, ideas, …Webb Miller, Jim Kent, David Haussler

RP scores and other bioinformatic input:Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski

Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU