cs262 lecture 9, win07, batzoglou gene regulation and microarrays

CS262 Lecture 9, Win07, Batzoglou

Gene Regulation and Microarrays


Finding Regulatory Motifs


Regulatory Motif Discovery

DNA

Group of co-regulated genesCommon subsequence

Find motifs within groups of corregulated genes

slide credits: M. Kellis


Characteristics of Regulatory Motifs

• Tiny

• Highly Variable

• ~Constant Size Because a constant-size

transcription factor binds

• Often repeated

• Low-complexity-ish


Sequence Logos

Height of each letter proportional to its frequency

Height of all letters proportional to information content at that position


Problem Definition

Probabilistic

Motif: Mij; 1 i W

1 j 4Mij = Prob[ letter j, pos i ]

Find best M, and positions p1,…, pN in sequences

Combinatorial

Motif M: m1…mW

Some of the mi’s blank

Find M that occurs in all si with k differences

Given a collection of promoter sequences s1,…, sN of genes with common expression


Discrete Approaches to Motif Finding


Discrete Formulations

Given sequences S = {x1, …, xn}

• A motif W is a consensus string w1…wK

• Find motif W* with “best” match to x1, …, xn

Definition of “best”:

d(W, xi) = min hamming dist. between W and any word in xi

d(W, S) = i d(W, xi)


Exhaustive Searches

1. Pattern-driven algorithm:

For W = AA…A to TT…T (4K possibilities)

Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4K )

(where N = i |xi|)

Advantage: Finds provably “best” motif W

Disadvantage: Time


Exhaustive Searches

2. Sample-driven algorithm:

For W = any K-long word occurring in some xi

Find d( W, S )

Report W* = argmin( d( W, S ) )or, Report a local improvement of W*

Running time: O( K N2 )

Advantage: Time

Disadvantage: If the true motif is weak and does not occur in data

then a random motif may score better than any instance of true motif


MULTIPROFILER

• Extended sample-driven approach

Given a K-long word W, define:

Nα(W) = words W’ in S s.t. d(W,W’) α

Idea:

Assume W is occurrence of true motif W*

Will use Nα(W) to correct “errors” in W


MULTIPROFILER

Assume W differs from true motif W* in at most L positions

Define:

A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K

Example:

K = 7; L = 3

W = ACGTTGAG = --A--CG


MULTIPROFILER

Algorithm:

For each W in S:For L = 1 to Lmax

1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,

1. Modify W by the wordlet G W’2. Compute d(W’, S)

Report W* = argmin d(W’, S)

Step 2 above: Smaller motif-finding problem; Use exhaustive search


Expectation Maximization in Motif Finding


All K-long wordsmotif background

Expectation Maximization

Algorithm (sketch):

1. Given genomic sequences find all k-long words

2. Assume each word is motif or background

3. Find likeliest Motif Model

Background Model

classification of words into either Motif or Background



Given sequences x1, …, xN,

• Find all k-long words X1,…, Xn

• Define motif model:

M = (M1,…, MK)

Mi = (Mi1,…, Mi4)(assume {A, C, G, T})

where Mij = Prob[ letter j occurs in motif position i ]

• Define background model:

B = B1, …, B4

Bi = Prob[ letter j in background sequence ]

motif background

ACGT

M1 MKM1 B



• Define

Zi1 = { 1, if Xi is motif;

0, otherwise }

Zi2 = { 0, if Xi is motif;

1, otherwise }

• Given a word Xi = x[s]…x[s+k],

P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]

P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]

Let 1 = ; 2 = (1 – )

motif background

ACGT

M1 MKM1 B

1 –



Define:

Parameter space = (M, B)

1: Motif; 2: Background

Objective:

Maximize log likelihood of model:

2

1

2

111

1

2

11

log)|(log

))|(log(),|,...(log

j jjij

n

ijiij

n

i

n

i jjijijn

ZZ

Z

XP

XPZXXP

ACGT

M1 MKM1 B

1 –



• Maximize expected likelihood, in iteration of two steps:

Expectation:

Find expected value of log likelihood:

Maximization:

Maximize expected value over ,

)],|,...([log 1 ZXXPE n


Expectation:

Find expected value of log likelihood:

2

1

2

111

1

log][)|(log][

)],|,...([log

j jjij

n

ijiij

n

i

n

ZZ EXPE

ZXXPE

where expected values of Z can be computed as follows:

ijii

jijijij Z

XPXP

XPZobZE *

)|()1()|(

)|(]1[Pr][

21

Expectation Maximization: E-step


Expectation Maximization: M-step

Maximization:

Maximize expected value over and independently

For , this has the following solution:(we won’t prove it)

Effectively, NEW is the expected # of motifs per position, given our current parameters

n

i

n

i

iii

NEW

n

Zxam ZZ

1 1

121

*))1log(log(arg **


• For = (M, B), define

cjk = E[ # times letter k appears in motif position j]

c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values

It then follows:

4

1k jk

jkNEWjk

c

cM

4

1 0

0

k k

kNEWk

c

cB

to not allow any 0’s, add pseudocounts

Expectation Maximization: M-step


Initial Parameters Matter!

Consider the following artificial example:

6-mers X1, …, Xn: (n = 2000)

990 words “AAAAAA” 990 words “CCCCCC” 20 words “ACACAC”

Some local maxima:

l = 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA

l = 1%; B = 50% C, 50% A M = 100% ACACAC


Overview of EM Algorithm

1. Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K)

2. Repeat:

a. Expectation

b. Maximization

3. Until change in = (M, B), falls below

4. Report results for several “good”


Gibbs Sampling in Motif Finding


Gibbs Sampling

• Given: x1, …, xN, motif length K, background B,

• Find: Model M Locations a1,…, aN in x1, …, xN

Maximizing log-odds likelihood ratio:

N

i

K

kika

ika

i

i

xB

xkM

1 1 )(

),(log


Gibbs Sampling

• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE

Algorithm (sketch):1. Initialization:

a. Select random locations in sequences x1, …, xN

b. Compute an initial model M from these locations

2. Sampling Iterations:a. Remove one sequence xi

b. Recalculate modelc. Pick a new location of motif in xi according to probability the location is a

motif occurrence


Gibbs Sampling

Initialization:• Select random locations 1,…, N in x1, …, xN

• For these locations, compute M:

))((1

1

N

ikajkj jx

BNM

i

where j are pseudocounts to avoid 0s,

and B = j j

• That is, Mkj is the number of occurrences of letter j in motif position k, over the total


Gibbs Sampling

Predictive Update:

• Select a sequence x = xi

• Remove xi, recompute model:

))(()1(

1

,1

N

isskajkj jx

BNM

s

where j are pseudocounts to avoid 0s,

and B = j j

M


Gibbs Sampling

Sampling:

For every K-long word xj,…,xj+k-1 in x:

Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)

Pi = Prob[ word | background ] B(xj)…B(xj+k-1)

Let

Sample a random new position ai

according to the probabilities A1,…, A|x|-k+1.

1||

1

/

/kx

jjj

jjj

PQ

PQA

0 |x|

Prob


Gibbs Sampling

Running Gibbs Sampling:

1. Initialize

2. Run until convergence

3. Repeat 1,2 several times, report common motifs


Advantages / Disadvantages

• Very similar to EM

Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space


Repeats, and a Better Background Model

• Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc.

Solution:

more elaborate background model

0th order: B = { pA, pC, pG, pT }

1st order: B = { P(A|A), P(A|C), …, P(T|T) }

…

Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)


Limits of Motif Finders

• Given upstream regions of coregulated genes: Increasing length makes motif finding harder – random motifs clutter the

true ones Decreasing length makes motif finding harder – true motif missing in

some sequences

Motif Challenge problem:

Find a (15,4) motif in N sequences of length

0

gene???


Example Application: Motifs in Yeast

Group:

Tavazoie et al. 1999, G. Church’s lab, Harvard

Data:

• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles

1. Clustering genes according to common expression

• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function

2. AlignACE motif finding • 600-long upstream regions


Motifs in Periodic Clusters


Motifs in Non-periodic Clusters


Motifs are preferentially conserved across evolution

Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * *

Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACGSpar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACGSmik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATGSbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * *

Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTSpar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCTSmik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCTSbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * ***

Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAASpar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGACSmik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCCSbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** ***

Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--TSpar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--GSmik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--GSbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * **

Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TTSpar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TTSmik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----ATSbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** *

Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACTSpar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACTSmik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACASbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** *

Scer TTAA-CGTCAAGGA---GAAAAAACTATASpar TTAT-CGTCAAGGAAA-GAACAAACTATASmik TCGTTCATCAAGAA----AAAAAACTA..Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** **

Gal10 Gal1Gal4

GAL10

GAL1

TBP

GAL4 GAL4 GAL4

GAL4

MIG1

TBPMIG1

Factor footprint

Conservation islandslide credits: M. Kellis

Is this enough to discover motifs?No.


Comparison-based Regulatory Motif Discovery

Study known motifs

Derive conservation rules

Discover novel motifs



Known motifs are frequently conserved

• Across the human promoter regions, the Erra motif: appears 434 times is conserved 162 times

Human

Dog

Mouse

Rat

Erra Erra Erra

Conservation rate: 37%

• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Erra enrichment: 5.4-fold– Erra p-value < 10-50 (25 standard deviations under binomial)

Motif Conservation Score (MCS)



Finding conserved motifs in whole genomesM. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals

1. Define seed “mini-motifs”

2. Filter and isolate mini-motifs that are more conserved than average

3. Extend mini-motifs to full motifs

4. Validate against known databases of motifs & annotations

5. Report novel motifs

CT A C GAN



Test 1: Intergenic conservation

Total count

Con

serv

ed c

ount

CGG-11-CCG



Test 2: Intergenic vs. Coding

Coding Conservation

Inte

rgen

ic C

onse

rvat

ion

CGG-11-CCG

Higher Conservation in Genes



Test 3: Upstream vs. Downstream

CGG-11-CCG

Downstream motifs?

MostPatterns

Downstream Conservation

Ups

trea

m C

onse

rvat

ion



Extend

Collapse

Full Motifs

Constructing full motifs

2,000 Mini-motifs

72 Full motifs

6CT A C GAR R

CT GR C C GA AA CCTG C GA A

CT GR C C GA ACT RA Y C GA A

Y 5Extend Extend Extend

Collapse Collapse Collapse

Merge

Test 1 Test 2 Test 3



Summary for promoter motifsRank Discovered Motif

Known TF motif

Tissue Enrichment

Distance bias

1 RCGCAnGCGY NRF-1 Yes Yes

2 CACGTG MYC Yes Yes

3 SCGGAAGY ELK-1 Yes Yes

4 ACTAYRnnnCCCR Yes Yes

5 GATTGGY NF-Y Yes Yes

6 GGGCGGR SP1 Yes Yes

7 TGAnTCA AP-1 Yes

8 TMTCGCGAnR Yes Yes

9 TGAYRTCA ATF3 Yes Yes

10 GCCATnTTG YY1 Yes

11 MGGAAGTG GABP Yes Yes

12 CAGGTG E12 Yes

13 CTTTGT LEF1 Yes

14 TGACGTCA ATF3 Yes Yes

15 CAGCTG AP-4 Yes

16 RYTTCCTG C-ETS-2 Yes Yes

17 AACTTT IRF1(*) Yes

18 TCAnnTGAY SREBP-1 Yes Yes

19 GKCGCn(7)TGAYG Yes Yes

20 GTGACGY E4F1 Yes Yes

21 GGAAnCGGAAnY Yes Yes

22 TGCGCAnK Yes Yes

23 TAATTA CHX10 Yes

24 GGGAGGRR MAZ Yes

25 TGACCTY ERRA Yes

• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias

75% have evidence

• Control sequences< 2% match known TF motifs

< 5% expression enrichment

< 3% show positional bias

< 7% false positives

Most discovered motifs are likely to be functional

NewNew

New

New

New


cs262 lecture 9, win07, batzoglou gene regulation and microarrays

Documents