intro to comp genomics lecture 9: motif finding. sequence specific transcription factors sequence...
TRANSCRIPT
Intro to Comp Genomics
Lecture 9: Motif finding
Sequence specific transcription factors
• Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene repression machinary
• TFs include a DNA binding domain that recognize specifically “regulatory elements” in the genome.
• The TF-DNA duplex is then used to target larger transcriptional structure to the genomic locus.
Sequence specificity is represented using consensus sequences or weight matrices
• The specificity of the TF binding is central to the understanding of the regulatory relations it can form.
• We are therefore interested in defining the DNA motifs that can be recognize by each TF.• A simple representation of the binding motif is the consensus site, usually derived by
studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe W=[A|T], S=[C|G]
• A more flexible representation is using weight matrices (PWM/PSSM):
• PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy
ACGCGTACGCGAACGCATTCGCGATAGCGT
123456
A60%20%0020%40%
C080%0100%00
G00100%080%0
T40%000060%
TF binding energy is approximated by weight matrices
Leu3 data (Liu and Clarke, JMB 2002)
We can interpret weight matrices as energy functions:
])[log(][
][)(
iiii
iii
spsw
swsE
This linear approximation is reasonable for most TFs.
• s
TF binding affinity is kinetically important, with possible functional implications
Kalir et al. Science 2001
Ume6
ChIP ranges
11.5
5.5
Av
era
ge
PW
M e
ne
rgy
Stronger binding
Strong
er prediction
Tanay. Genome Res 2006
TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications
Re TSS
Re ATG
Lee et al. Nat Gen 2007
TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications
Barski et al. Cell 2007
Active
Inactive
TFBSs are clustered in promoters or in “sequence modules”
• The distribution of binding sites in the genome is non uniform• In small genomes, most sites are in promoters, and there is a bias toward
nucleosome free region near the TSS• In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are
frequently away from the TSS. These represent enhancers.• A single binding site, without the context of other co-sites, is unlikely to represent a
functional loci
Constructing a weight matrix from aligned TFBSs is trivial
• This is done by counting (or “voting”)• Several databases (e.g., TRANSFAC, JASPAR) contain matrices
that were constructed from a set of curated and validated binding site
• Validated site: usually using “promoter bashing” – testing reported constructs with and without the putative site
Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers
However, there are no real different 830 matrices outthere – the real binding repertoire in nature is still somewhat unclear
Probabilistic interpretation of weight matrices and a generative model
• One can think of a weight matrix as a probabilistic model for binding sites:
• This is the site independent model, defining a probability space over k-mers• Given a set of aligned k-mers, we know that the ML motif model is derived by voting (a set of independent multinomial variables – like the dice case)
• Now assume we are given a set of sequences that are supposed to include binding sites (one for each), but that we don’t know where the binding sites are.• In other words the position of the binding site is a hidden variable h.
• We introduce a background model Pb that describes the sequence outside of the binding site (usually a d-order Markov model)
• Given complete data we can write down the likelihood of a sequence s as:
k
ii imPmP
1
])[()(
k
ibackiback
S
ibackback
ildilsilsPilsPsPlsP
idisisPsP
1
||
1
]))1..[|][(/])[(()()|,(
]))1..[|][()(
• Inference of the binding site location posterior:
• Note that only k-factors should be computed for each location (Pb(s) is constant))
Using EM to discover PWMs de-novo
i
isPlsPslP )|,(/)|,(),|( 111
• Inference of the binding site location posterior:
• Note that only k factors should be computed for each location (Pb(s) is constant))
• Starting with an initial motif model, we can apply a standard EM:
E:
j Sl
jji cilsslPcP
||..0
1 )],[(),|()( M:
i
isPlsPslP )|,(/)|,(),|( 111
• As always with the EM, initializing to reasonable PWM would be critical
Following Baily and Elkan, MEME 1995
• If we assume some of the sequences may lack a binding site, this should be incorporated into the model:
Allowing false positive sequences
k
ibackiback ildilsilsPilsPsPhitPlsP
1
]))1..[|][(/])[(()(*)()|,(
hitl
s
• This is sometime called the ZOOPS model (Zero or one positions)
• In Bayesian terms: – Probability of sequence hit P(hit | S)– Probability of hit at position l = Pr(l|S)
• We can consider the PWM parameters as variables in the model• Learning the parameters is then equivalent to inference
Using Gibbs sampling to discover PWMs de-novo
hitl
s
• We can use Gibbs sampling to sample the hidden sites and estimate the PWM
hitl
s
hitl
s
• This is done by estimating the PWM from all locations except for the one we sample, and computing the hit probabilities as shown before
• Note that we are working with the MAP (Maximum a-posteriori) to do the sampling:
),,..,,,..,|( 111 SlllllP niij
Gibbs: Lawrence et al. Science 1993
),|,..,,,..,(
)|(111maxarg
SllllL
lPniiMAP
MAPj
• But this can be shown to approximate:
),,..,,,..,|( 111 niij lllllP
Generalizing PWMs to allow site dependencies: mixture of PWMs and Trees
Barash et al., RECOMB 2003
k
iback
back
ildilsilsP
lllsPsPlsP
1
])1..[|][(
)|]..[()()|,(
Mixture of PWMs
Tree motif
We only change the motif component of the likelihood model
Learning the model can become more difficult
This is because computing the ML model parameter from complete data may be challenging
Discriminative scores for motifs
• So far we used a generative probabilistic model to learn PWMs• The model was designed to generate the data from parameters• We assumed that TFBSs are distributed differently than some fixed background
model
• If our background model is wrong, we will get the wrong motifs..
• A different scoring approach try to maximize the discriminative power of the motif model.
• We will not go here into the details of discriminative vs. generative models, but we shall exemplify the discriminative approach for PWMs.
Lousy discriminator High specificity discriminator High sensitivity discriminator
Hypergeometric scores and thresholding PWMs
||
||
||||
)|(|
B
n
kB
An
k
A
kBAP
PWM score threshold
Nu
mb
er
of
seq
ue
nce
s
Positive
True positive
For a discriminative score, we need to decide on both the PWM model and the threshold.
Hyper geometric probability
(sum for j>=k is the hg p-value)
Exhaustive k-mer search
• A very common strategy for motif finding is to do exhustive k-mer search.
• Given a set of hits and a set of non hits, we will compute the number of occurrences of each k-mer in the two sets and report all cases that have a discriminative score higher than some threshold
• Since k-mers either match or do not match, there is no issue with the threshold
• For DNA, we will typically scan k=5-8. • This can be done efficiently using a map/hash:
– Iterate on short sequence windows (of the desired k length)
– For each window, mark the appearance of the k-mer in a table
– Avoid double counting using a second map
• It is easy to generalize such exhaustive approaches to include gaps or other types of degeneracy.
Refining k-mers to PWMs using heuristic “EM”
• K-mer scan is an excellent intial step for finding refined weight matrices. For example, we can use them to initialize an EM.
• If we want to find a weight matrix, but want to stick to the discriminative setting, we can heuristically use and “EM-like” algorithm:
– Start with a k-mer seed– Add uniform prior to generate a PWM– Compute the optimal PWM threshold (maximal hyper-geometric score)– Restimate the PWM by voting from all PWM true positives
• Consider additional PWM positions• Bound the position entropies to avoid over-fitting
– Repeat two last steps until fail to improve score
• There are of course no guarantees for improving the scores, but empirically this approach works very well.
High density arrays quantify TF binding preferences and identify binding sites in high throughput
Harbison et al., Nature 2004
• Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome
• The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them
If only biology was that simple…
Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges
In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution.
PWM regression exploits variable levels of binding affinity to robustly recover binding preferences.
-16.5
-14.5
-12.5
-2 2 6
ChIP log(binding ratio)
-15
-13
-11
-2 2 6
ChIP log(binding ratio)
PW
M s
eq
ue
nc
e e
ne
rgy
r = 0.42 = 0.20
2
8
14
-2 2 6
ChIP log(binding ratio)
PW
M s
eq
ue
nc
e e
ne
rgy
r = 0.42 = 0.28
r = 0.42 = 0.26
ABF1GCN4 MBP1
PW
M s
eq
ue
nc
e e
ne
rgy
r = 0.21 = 0.72
r = 0.28 = 0.8
r = 0.11 = 0.74
Correlation between PWM predicted binding and ChIP experiments spans high, medium and low affinity sites
)),|((maxarg svsFspearman
Motif regression optimizes the PWM given the overall correlation of the predicted binding energies and the measured ChIP values vs
Tanay, GR 2006
Direct measurments of the in-vitro binding afffinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008)
1. Download the promoters of the yeast genome for SGD (1000 upstream annotated TSSs)
2. Get the yeast GO gene annotations
3. Implement the discriminative k-mer scanner described above:enumerate over all 6-mers (with one gap of up to 6 characters)compute the hyper-geometric p-value for discriminating using the motifrefine the k-mer into a PWM by:
1) build a PWM from the motif seed using a uniform prior (i.e., position i has 97% to be equal to the motif character at position I and 1% probability to be different).
2) compute the optimal PWM likelihood threshold: -for each sequence find the position with maximum PWM likelihood-for each threshold on PWM likelihood divide the genome into two sets -compute the hg p-value according to the intersection with your
annotation set-select the threshold with minimal p-value
3) retrain your PWM using the “hits” the got a score above thre likelihood threshold (just count the number of nucleotide at each position)
4) continue iterating until convergence.
4. Search for motifs in selected annotations: cell cycle, ribosome biogenesis, RNA processing, amino acid metabolism, sulfur metabolism, meiosis, stress response, heat shock.
5. To control for your results, shuffle the promoters between the genes and rerun your motif finder while recoding your p-values. Determine an empirical p-value threshold, compare it to the expected p-value given just the multiple testing effect.
6. Report the annotations and motifs and the random p-values/likelihoods you got
Your Task