gene regulatory elements discovered by vertebrate genome comparisons laboratory heads penn state...
TRANSCRIPT
Gene Regulatory Elements Discovered by Vertebrate Genome
Comparisons
Laboratory Heads
Penn State University: Center Comparative Genomics and Bioinform.
Webb Miller Francesca Chiaromonte
Ross Hardison Anton Nekrutenko
University of California at Santa Cruz:
David Haussler (HHMRI) Jim Kent
Institute for Systems Biology
Arian Smit
University of Pennsylvania, Children’s Hospital of Philadelphia
Mitchell Weiss
Consortia for sequence and analysis of: Mouse, Rat, Chicken
DNA sequences of mammalian genomes• Human: 2.9 billion bp, “finished”
– High quality, comprehensive sequence, very few gaps
• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.• About 40% of the human genome aligns with mouse
– This is conserved, but not all is under selection.
• About 5-6% of the human genome is under purifying selection since the rodent-primate divergence
• About 1.5% codes for protein• The 4.5% of the human genome that is under selection but
does not code for protein should have:– Regulatory sequences– Non-protein coding genes– Other important sequences
Alignment of vertebrate genomes
• blastZ for pairwise alignments• multiZ for multiple alignment
– Human, chimp, mouse, rat, chicken, dog– Organize local alignments – Chains and nets
• All against all comparisons– High sensitivity and specificity
• Computer cluster at UC Santa Cruz – 1024 cpus Pentium III – Job takes about half a day
• Results available at– UCSC Genome Browser http://genome.ucsc.edu– GALA database: http://www.bx.psu.edu
Scott Schwartz Webb Miller
David Haussler
Jim Kent
Schwartz et al., 2003, blastZ, Genome ResearchBlanchette et al., 2004, TBA and multiZ, Genome Research
Net
Genome-wide local alignment chains
blastZ: Each segment of human is given the opportunity to align with all mouse sequences.
Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.
Human
Run blastZ in parallel for all human segments. Collect all local alignments above threshold.
Organize local alignments into a set of chains based on position in assembly and orientation.
Level 1 chainLevel 2 chain
Mouse
Comparative genomics to find functional sequencesGenome size
2,900
2,400
2,500
1,200
Human
Mouse Rat
All mammals1000 Mbp
Identify functional sequences: ~ 145 Mbp
million base pairs(Mbp)
Find common sequences
Also birds: 72Mb
Papers in Nature from rat and chicken genome consortia, 2004
Conservation by type of function
Human-mouse-rat
Human-chicken
For several reference sets of human known functional DNA segments, what fraction aligns?
Chicken Genome Sequencing Consortium, 2004, submitted
Score alignments for level of conservation
• Multiple alignment scores (Margulies et al., 2003)• PhyloHMMcons - PhastCons (Siepel and Haussler, 2003; Siepel et
al. 2005)– Phylogenetic Hidden Markov Model– Posterior probability that a site is among the 10% most highly
conserved sites– Allows for variation in rates and autocorrelation in rates
Example of PhastCons output: UCSC Genome Browser
Available at http://genome.ucsc.edu/
Other ways to use alignments to find functional sequences
• Score alignments by frequency of matches to patterns distinctive for CRMs– Regulatory potential (Elnitski et al., 2003; Kolbe et al.,
2004)
• Factor binding sites conserved in human, mouse and rat – Tffind (from M. Weirauch, Schwartz et al., 2003)
1. Collapse the alignment to a small alphabet, e.g.Match involving G or C = S Transition = I Gap = GMatch involving A or T = W Transversion = V
Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C ACollapsed alphabet S W I I S V G G V I S V S W
Evaluate patterns in alignments to discriminate functional classes of DNA
2. Is a pattern, e.g., SWIIS followed by V found more frequently in alignments of
known cis-regulatory modules (set of 93) or neutral DNA (200 ancestral repeats)?
3. The regulatory potential for any alignment measures extent to which its patterns are more like those in regulatory regions than in neutral DNA.
5/101/6
= 31/42/8
= 1 1/43/6
= 0.5
… A A G C C C G — A T A A C G G G C G C G C C C C C T T T A T A T A C C C …
… T A G C C G G A A T A A C G G G G C G C G C C C C T T T A T A T A C A C …
……………………………………s1 s2 s3 s4 s5 s6 s7 ………………………..sW ………….
€
RP= logpREG(st |st−1...st−T )pAR(st |st−1...st−T )
⎛
⎝ ⎜ ⎞
⎠ ⎟
t=1...W
∑
W (sliding window)
T (order)alphabet) (collapsed As∈
Order T Markov Model on A : transition probabilities estimated on REG training data
.
.
.
)...(
)...()...|(
1
11
−−
−−−− =
ssfsssf
ssspTREG
TREGTREG
Order T Markov Model on A : transition probabilities estimated on AR training data
.
.
.
)...(
)...()...|(
1
11
−−
−−−− =
ssfsssf
ssspTAR
TARTAR
Regulatory Potential Score
TrainingGenome-wide computation
RP scores have good discriminatory power
Kolbe et al., 2004, Genome Research
Alignment-based
scores can find some but not all
known CRMs in the
HBB complex
King et al., submitted
RP has better performance than phastCons or MCS
King et al., submitted
Other CRMs are easier to identify than those in the HBB complex
King et al., submitted
Binding sites conserved between species
• tffind: Identify high-quality matches to a weight matrix in one sequence (e.g. human) that also aligns with other sequences (e.g. mouse and rat)
• Look for matches to weight matrix in 2nd and 3rd sequences, in the part of the alignment that aligns to match to weight matrix in first species
• GALA records these matches
HMR
not
Matt Weirach
Genes co-expressed in late erythroid maturation• Two somatic cell models systems:• Murine erythroleukemia (MEL) cells: mature into late erythroblasts
when induced with small organic compounds.• G1E-ER cells: proerythroblast line from mice lacking the transcription
factor GATA-1. Can restore the activity of GATA-1 by expressing an estrogen-responsive form of GATA-1.
• Use microarray analysis of each to find genes that increase or decrease expression upon induction. Many of the genes respond similarly in the two systems. Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO:
Genes whose expression increases during maturation, confirmed by RT-PCR
Predicting cis-regulatory modules (preCRMs)
Identify a genomic region with a regulated gene.
Find all intervals whose RP score exceeds an empirical threshold.
Subtract exons
Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS)
Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp are preCRMs.
Test predicted cis-regulatory modules (preCRMs)
• Amplify the preCRMs and test them by– (1) Enhancement in transient transfections of
erythroid cells– (2) Activation and induction of reporter genes
after site-directed, stable integration in erythroid cells
– (3) Chromatin immunoprecipitation (ChIP) for GATA-1
Transient transfection assay for enhancers
Dual luciferase assay
FF luciferaseHBGtest
FF luciferaseHBG
Ren luciferasetk
K562 cells
Compare to:
Ren luciferasetk
prom
prom
0
2
4
6
8
10
12
14
MCS HS2 Alas2pCRM1
I
II
positive control30-fold
Negative controls do not enhance transient expression
0
1
2
3
4
5
6
7
parentLucFog1N1Fog1N2Hipk2N2Gata2N2Alas2N1HS2N1HS2N2Alas2N2Vav2N1Vav2N2CdmN1
Coro2aN1Gata2r.2N1
Fold change
Negative controls are segments of mouse DNA that align with rat and human but have low RP scores and do not have a match to a GATA-1 binding site. They have almost no effect on the level of expression of the reporter gene in erythroid cells.
7 of 24 Zfpm1 preCRMs enhance transient expression
Site-directed recombination to stably integrate expression cassettes
Recombinase-mediated cassette exchange, Bouhassira et al.
9 of 24 Zfpm1 preCRMs enhance after stable integration at RL5
PreCRMs in Fog1 bind GATA-1 in vivo
Chromatin immunoprecipitation assay
13 of 24 Zfpm1 preCRMs are validated in at least one assay
Validated
Not validated
Validation of preCRM in Alas2
All preCRMs in Gata2 are functional in at least one assay
ChIP data are from publications from E. Bresnick’s lab.
Frequent validation at 3 other loci
Infrequent validation of preCRMs in Hipk2
Assay Number Number %tested positive validated
GATA-1 ChIPs 5 5 100Transient 64 18 28 transfectionsSite-directed 54 24 44 integrantsAll assays 64 33 52
About half of the preCRMs are validated as functional
Omitting Hipk2, validation rate increases to 67%
Assay Number Number %tested positive validated
GATA-1 ChIPs 5 5 100Transient 45 17 38 transfectionsSite-directed 43 23 53 integrantsAll assays 45 30 67
N Mean %G+C StDev preCRMS not validated 31 50.53 8.54preCRMS validated 32 54.87 6.46Difference = mu (“false positive”) - mu (verified) = -4.35 %G+Ct-Test of difference = 0 (vs not =): T-Value = -2.27 P-Value = 0.027 DF = 55
%G+C is higher in validated preCRMs
Mean Mean N RPscores StDev phastCons StDev
preCRMs not validated 31 2.020 0.381 0.511 0.229preCRMs validated 32 2.232 0.456 0.571 0.210Difference -0.212 -0.061t-Test of difference = 0 (vs not =): t-value = -2.01 -1.10
p-value = 0.049 p-value = 0.277df = 59 df = 60
Average scores for RP are significantly higher in validated preCRMs
Lab Folks
Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King
GALA: database of Genome ALignments and Annotation http://www.bx.psu.edu/
• Database for human, chimp, mouse, rat, and chicken genomes
• Whole-genome sequence alignments– 16 million alignments for human-mouse-rat– Probabilities of sequences being under selection (200 million)
– Goodness of fit to models of alignments in known regulatory regions (RP-scores) (200 million)
• Annotations– Known and predicted genes (39,000)– Microarray data from GNF (14,000 genes, multiple tissues)
– Transcription factor binding sites (190 million)
– Conserved factor binding sites (4 million, HMR)
• Integrate information• Simple or complex queries
Yi Zhang
CathyRiemer
Belinda Giardine
Galaxy metaserver and data sources
www.bx.psu.edu
Galaxy Portal page
UCSC Bioinformatics Table Browser
Galaxy History Page
Operations: Intersection, Clustering
Output to UCSC Genome Browser
Conclusions
• Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection).
• Alignments can be used to predict certain functional regions, such as coding exons and some cis-regulatory elements.
• The predictions of cis-regulatory elements for erythroid genes has a good validation rate.
• Databases such as the UCSC Table Browser, GALA and Galaxy provide access to these data.