intro to comp genomics lecture 9: motif finding. sequence specific transcription factors sequence...

Intro to Comp Genomics

Lecture 9: Motif finding

Sequence specific transcription factors

• Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene repression machinary

• TFs include a DNA binding domain that recognize specifically “regulatory elements” in the genome.

• The TF-DNA duplex is then used to target larger transcriptional structure to the genomic locus.

Sequence specificity is represented using consensus sequences or weight matrices

• The specificity of the TF binding is central to the understanding of the regulatory relations it can form.

• We are therefore interested in defining the DNA motifs that can be recognize by each TF.• A simple representation of the binding motif is the consensus site, usually derived by

studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC characters (erpresenting pairs of nucleotides, for exampe W=[A|T], S=[C|G]

• A more flexible representation is using weight matrices (PWM/PSSM):

• PWMs are frequently plotted using motif logos, in which the height of the character correspond to its probability, scaled by the position entropy

ACGCGTACGCGAACGCATTCGCGATAGCGT

123456

A60%20%0020%40%

C080%0100%00

G00100%080%0

T40%000060%

TF binding energy is approximated by weight matrices

Leu3 data (Liu and Clarke, JMB 2002)

We can interpret weight matrices as energy functions:

])[log(][

][)(

iiii

iii

spsw

swsE

This linear approximation is reasonable for most TFs.

• s

TF binding affinity is kinetically important, with possible functional implications

Kalir et al. Science 2001

Ume6

ChIP ranges

11.5

5.5

Av

era

ge

PW

M e

ne

rgy

Stronger binding

Strong

er prediction

Tanay. Genome Res 2006

TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications

Re TSS

Re ATG

Lee et al. Nat Gen 2007

TFs are present at only a fraction of their optimal sequence tragets. Binding is combinatorially regulated by co-factors, nucleosomes and histone modifications

Barski et al. Cell 2007

Active

Inactive

TFBSs are clustered in promoters or in “sequence modules”

• The distribution of binding sites in the genome is non uniform• In small genomes, most sites are in promoters, and there is a bias toward

nucleosome free region near the TSS• In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are

frequently away from the TSS. These represent enhancers.• A single binding site, without the context of other co-sites, is unlikely to represent a

functional loci

Constructing a weight matrix from aligned TFBSs is trivial

• This is done by counting (or “voting”)• Several databases (e.g., TRANSFAC, JASPAR) contain matrices

that were constructed from a set of curated and validated binding site

• Validated site: usually using “promoter bashing” – testing reported constructs with and without the putative site

Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000 papers

However, there are no real different 830 matrices outthere – the real binding repertoire in nature is still somewhat unclear

Probabilistic interpretation of weight matrices and a generative model

• One can think of a weight matrix as a probabilistic model for binding sites:

• This is the site independent model, defining a probability space over k-mers• Given a set of aligned k-mers, we know that the ML motif model is derived by voting (a set of independent multinomial variables – like the dice case)

• Now assume we are given a set of sequences that are supposed to include binding sites (one for each), but that we don’t know where the binding sites are.• In other words the position of the binding site is a hidden variable h.

• We introduce a background model Pb that describes the sequence outside of the binding site (usually a d-order Markov model)

• Given complete data we can write down the likelihood of a sequence s as:

k

ii imPmP

1

])[()(

k

ibackiback

S

ibackback

ildilsilsPilsPsPlsP

idisisPsP

1

||

1

]))1..[|][(/])[(()()|,(

]))1..[|][()(

• Inference of the binding site location posterior:

• Note that only k-factors should be computed for each location (Pb(s) is constant))

Using EM to discover PWMs de-novo

i

isPlsPslP )|,(/)|,(),|( 111

• Inference of the binding site location posterior:

• Note that only k factors should be computed for each location (Pb(s) is constant))

• Starting with an initial motif model, we can apply a standard EM:

E:

j Sl

jji cilsslPcP

||..0

1 )],[(),|()( M:

i

isPlsPslP )|,(/)|,(),|( 111

• As always with the EM, initializing to reasonable PWM would be critical

Following Baily and Elkan, MEME 1995

• If we assume some of the sequences may lack a binding site, this should be incorporated into the model:

Allowing false positive sequences

k

ibackiback ildilsilsPilsPsPhitPlsP

1

]))1..[|][(/])[(()(*)()|,(

hitl

s

• This is sometime called the ZOOPS model (Zero or one positions)

• In Bayesian terms: – Probability of sequence hit P(hit | S)– Probability of hit at position l = Pr(l|S)

• We can consider the PWM parameters as variables in the model• Learning the parameters is then equivalent to inference

Using Gibbs sampling to discover PWMs de-novo

hitl

s

• We can use Gibbs sampling to sample the hidden sites and estimate the PWM

hitl

s

hitl

s

• This is done by estimating the PWM from all locations except for the one we sample, and computing the hit probabilities as shown before

• Note that we are working with the MAP (Maximum a-posteriori) to do the sampling:

),,..,,,..,|( 111 SlllllP niij

Gibbs: Lawrence et al. Science 1993

),|,..,,,..,(

)|(111maxarg

SllllL

lPniiMAP

MAPj

• But this can be shown to approximate:

),,..,,,..,|( 111 niij lllllP

Generalizing PWMs to allow site dependencies: mixture of PWMs and Trees

Barash et al., RECOMB 2003

k

iback

back

ildilsilsP

lllsPsPlsP

1

])1..[|][(

)|]..[()()|,(

Mixture of PWMs

Tree motif

We only change the motif component of the likelihood model

Learning the model can become more difficult

This is because computing the ML model parameter from complete data may be challenging

Discriminative scores for motifs

• So far we used a generative probabilistic model to learn PWMs• The model was designed to generate the data from parameters• We assumed that TFBSs are distributed differently than some fixed background

model

• If our background model is wrong, we will get the wrong motifs..

• A different scoring approach try to maximize the discriminative power of the motif model.

• We will not go here into the details of discriminative vs. generative models, but we shall exemplify the discriminative approach for PWMs.

Lousy discriminator High specificity discriminator High sensitivity discriminator

Hypergeometric scores and thresholding PWMs

||

||

||||

)|(|

B

n

kB

An

k

A

kBAP

PWM score threshold

Nu

mb

er

of

seq

ue

nce

s

Positive

True positive

For a discriminative score, we need to decide on both the PWM model and the threshold.

Hyper geometric probability

(sum for j>=k is the hg p-value)

Exhaustive k-mer search

• A very common strategy for motif finding is to do exhustive k-mer search.

• Given a set of hits and a set of non hits, we will compute the number of occurrences of each k-mer in the two sets and report all cases that have a discriminative score higher than some threshold

• Since k-mers either match or do not match, there is no issue with the threshold

• For DNA, we will typically scan k=5-8. • This can be done efficiently using a map/hash:

– Iterate on short sequence windows (of the desired k length)

– For each window, mark the appearance of the k-mer in a table

– Avoid double counting using a second map

• It is easy to generalize such exhaustive approaches to include gaps or other types of degeneracy.

Refining k-mers to PWMs using heuristic “EM”

• K-mer scan is an excellent intial step for finding refined weight matrices. For example, we can use them to initialize an EM.

• If we want to find a weight matrix, but want to stick to the discriminative setting, we can heuristically use and “EM-like” algorithm:

– Start with a k-mer seed– Add uniform prior to generate a PWM– Compute the optimal PWM threshold (maximal hyper-geometric score)– Restimate the PWM by voting from all PWM true positives

• Consider additional PWM positions• Bound the position entropies to avoid over-fitting

– Repeat two last steps until fail to improve score

• There are of course no guarantees for improving the scores, but empirically this approach works very well.

High density arrays quantify TF binding preferences and identify binding sites in high throughput

Harbison et al., Nature 2004

• Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide fashion for any genome

• The problem is shifting from identifying binding sites to understanding their function and determining how sequences define them

If only biology was that simple…

Discrete and deterministic “binding sites” in yeast as identified by Young, Fraenkel and colleuges

In fact, binding is rarely deterministic and discrete, and simple wiring is something you should treat with extreme caution.

PWM regression exploits variable levels of binding affinity to robustly recover binding preferences.

-16.5

-14.5

-12.5

-2 2 6

ChIP log(binding ratio)

-15

-13

-11

-2 2 6


PW

M s

eq

ue

nc

e e

ne

rgy

r = 0.42 = 0.20

2

8

14

-2 2 6


PW

M s

eq

ue

nc

e e

ne

rgy

r = 0.42 = 0.28

r = 0.42 = 0.26

ABF1GCN4 MBP1

PW

M s

eq

ue

nc

e e

ne

rgy

r = 0.21 = 0.72

r = 0.28 = 0.8

r = 0.11 = 0.74

Correlation between PWM predicted binding and ChIP experiments spans high, medium and low affinity sites

)),|((maxarg svsFspearman

Motif regression optimizes the PWM given the overall correlation of the predicted binding energies and the measured ChIP values vs

Tanay, GR 2006

Direct measurments of the in-vitro binding afffinity of 8-mers and DNA binding domains (here just a library of homeodomains, from Berger et al. 2008)

1. Download the promoters of the yeast genome for SGD (1000 upstream annotated TSSs)

2. Get the yeast GO gene annotations

3. Implement the discriminative k-mer scanner described above:enumerate over all 6-mers (with one gap of up to 6 characters)compute the hyper-geometric p-value for discriminating using the motifrefine the k-mer into a PWM by:

1) build a PWM from the motif seed using a uniform prior (i.e., position i has 97% to be equal to the motif character at position I and 1% probability to be different).

2) compute the optimal PWM likelihood threshold: -for each sequence find the position with maximum PWM likelihood-for each threshold on PWM likelihood divide the genome into two sets -compute the hg p-value according to the intersection with your

annotation set-select the threshold with minimal p-value

3) retrain your PWM using the “hits” the got a score above thre likelihood threshold (just count the number of nucleotide at each position)

4) continue iterating until convergence.

4. Search for motifs in selected annotations: cell cycle, ribosome biogenesis, RNA processing, amino acid metabolism, sulfur metabolism, meiosis, stress response, heat shock.

5. To control for your results, shuffle the promoters between the genes and rerun your motif finder while recoding your p-values. Determine an empirical p-value threshold, compare it to the expected p-value given just the multiple testing effect.

6. Report the annotations and motifs and the random p-values/likelihoods you got

Your Task

intro to comp genomics lecture 9: motif finding. sequence specific transcription factors sequence...

Documents

binding motif

tf binding energy

single binding site

binding site validated

unclear slide

distribution of binding

dna binding domain

s tf binding affinity