cs262 lecture 9, win07, batzoglou gene regulation and microarrays

46
262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gene Regulation and Microarrays

Page 2: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Finding Regulatory Motifs

Page 3: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Regulatory Motif Discovery

DNA

Group of co-regulated genesCommon subsequence

Find motifs within groups of corregulated genes

slide credits: M. Kellis

Page 4: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Characteristics of Regulatory Motifs

• Tiny

• Highly Variable

• ~Constant Size Because a constant-size

transcription factor binds

• Often repeated

• Low-complexity-ish

Page 5: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Sequence Logos

Height of each letter proportional to its frequency

Height of all letters proportional to information content at that position

Page 6: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Problem Definition

Probabilistic

Motif: Mij; 1 i W

1 j 4Mij = Prob[ letter j, pos i ]

Find best M, and positions p1,…, pN in sequences

Combinatorial

Motif M: m1…mW

Some of the mi’s blank

Find M that occurs in all si with k differences

Given a collection of promoter sequences s1,…, sN of genes with common expression

Page 7: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Discrete Approaches to Motif Finding

Page 8: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Discrete Formulations

Given sequences S = {x1, …, xn}

• A motif W is a consensus string w1…wK

• Find motif W* with “best” match to x1, …, xn

Definition of “best”:

d(W, xi) = min hamming dist. between W and any word in xi

d(W, S) = i d(W, xi)

Page 9: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Exhaustive Searches

1. Pattern-driven algorithm:

For W = AA…A to TT…T (4K possibilities)

Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4K )

(where N = i |xi|)

Advantage: Finds provably “best” motif W

Disadvantage: Time

Page 10: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Exhaustive Searches

2. Sample-driven algorithm:

For W = any K-long word occurring in some xi

Find d( W, S )

Report W* = argmin( d( W, S ) )or, Report a local improvement of W*

Running time: O( K N2 )

Advantage: Time

Disadvantage: If the true motif is weak and does not occur in data

then a random motif may score better than any instance of true motif

Page 11: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

MULTIPROFILER

• Extended sample-driven approach

Given a K-long word W, define:

Nα(W) = words W’ in S s.t. d(W,W’) α

Idea:

Assume W is occurrence of true motif W*

Will use Nα(W) to correct “errors” in W

Page 12: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

MULTIPROFILER

Assume W differs from true motif W* in at most L positions

Define:

A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K

Example:

K = 7; L = 3

W = ACGTTGAG = --A--CG

Page 13: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

MULTIPROFILER

Algorithm:

For each W in S:For L = 1 to Lmax

1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,

1. Modify W by the wordlet G W’2. Compute d(W’, S)

Report W* = argmin d(W’, S)

Step 2 above: Smaller motif-finding problem; Use exhaustive search

Page 14: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation Maximization in Motif Finding

Page 15: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

All K-long wordsmotif background

Expectation Maximization

Algorithm (sketch):

1. Given genomic sequences find all k-long words

2. Assume each word is motif or background

3. Find likeliest Motif Model

Background Model

classification of words into either Motif or Background

Page 16: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation Maximization

Given sequences x1, …, xN,

• Find all k-long words X1,…, Xn

• Define motif model:

M = (M1,…, MK)

Mi = (Mi1,…, Mi4)(assume {A, C, G, T})

where Mij = Prob[ letter j occurs in motif position i ]

• Define background model:

B = B1, …, B4

Bi = Prob[ letter j in background sequence ]

motif background

ACGT

M1 MKM1 B

Page 17: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation Maximization

• Define

Zi1 = { 1, if Xi is motif;

0, otherwise }

Zi2 = { 0, if Xi is motif;

1, otherwise }

• Given a word Xi = x[s]…x[s+k],

P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]

P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]

Let 1 = ; 2 = (1 – )

motif background

ACGT

M1 MKM1 B

1 –

Page 18: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation Maximization

Define:

Parameter space = (M, B)

1: Motif; 2: Background

Objective:

Maximize log likelihood of model:

2

1

2

111

1

2

11

log)|(log

))|(log(),|,...(log

j jjij

n

ijiij

n

i

n

i jjijijn

ZZ

Z

XP

XPZXXP

ACGT

M1 MKM1 B

1 –

Page 19: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation Maximization

• Maximize expected likelihood, in iteration of two steps:

Expectation:

Find expected value of log likelihood:

Maximization:

Maximize expected value over ,

)],|,...([log 1 ZXXPE n

Page 20: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation:

Find expected value of log likelihood:

2

1

2

111

1

log][)|(log][

)],|,...([log

j jjij

n

ijiij

n

i

n

ZZ EXPE

ZXXPE

where expected values of Z can be computed as follows:

ijii

jijijij Z

XPXP

XPZobZE *

)|()1()|(

)|(]1[Pr][

21

Expectation Maximization: E-step

Page 21: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Expectation Maximization: M-step

Maximization:

Maximize expected value over and independently

For , this has the following solution:(we won’t prove it)

Effectively, NEW is the expected # of motifs per position, given our current parameters

n

i

n

i

iii

NEW

n

Zxam ZZ

1 1

121

*))1log(log(arg **

Page 22: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

• For = (M, B), define

cjk = E[ # times letter k appears in motif position j]

c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values

It then follows:

4

1k jk

jkNEWjk

c

cM

4

1 0

0

k k

kNEWk

c

cB

to not allow any 0’s, add pseudocounts

Expectation Maximization: M-step

Page 23: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Initial Parameters Matter!

Consider the following artificial example:

6-mers X1, …, Xn: (n = 2000)

990 words “AAAAAA” 990 words “CCCCCC” 20 words “ACACAC”

Some local maxima:

l = 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA

l = 1%; B = 50% C, 50% A M = 100% ACACAC

Page 24: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Overview of EM Algorithm

1. Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K)

2. Repeat:

a. Expectation

b. Maximization

3. Until change in = (M, B), falls below

4. Report results for several “good”

Page 25: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling in Motif Finding

Page 26: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling

• Given: x1, …, xN, motif length K, background B,

• Find: Model M Locations a1,…, aN in x1, …, xN

Maximizing log-odds likelihood ratio:

N

i

K

kika

ika

i

i

xB

xkM

1 1 )(

),(log

Page 27: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling

• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE

Algorithm (sketch):1. Initialization:

a. Select random locations in sequences x1, …, xN

b. Compute an initial model M from these locations

2. Sampling Iterations:a. Remove one sequence xi

b. Recalculate modelc. Pick a new location of motif in xi according to probability the location is a

motif occurrence

Page 28: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling

Initialization:• Select random locations 1,…, N in x1, …, xN

• For these locations, compute M:

))((1

1

N

ikajkj jx

BNM

i

where j are pseudocounts to avoid 0s,

and B = j j

• That is, Mkj is the number of occurrences of letter j in motif position k, over the total

Page 29: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling

Predictive Update:

• Select a sequence x = xi

• Remove xi, recompute model:

))(()1(

1

,1

N

isskajkj jx

BNM

s

where j are pseudocounts to avoid 0s,

and B = j j

M

Page 30: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling

Sampling:

For every K-long word xj,…,xj+k-1 in x:

Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)

Pi = Prob[ word | background ] B(xj)…B(xj+k-1)

Let

Sample a random new position ai

according to the probabilities A1,…, A|x|-k+1.

1||

1

/

/kx

jjj

jjj

PQ

PQA

0 |x|

Prob

Page 31: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Gibbs Sampling

Running Gibbs Sampling:

1. Initialize

2. Run until convergence

3. Repeat 1,2 several times, report common motifs

Page 32: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Advantages / Disadvantages

• Very similar to EM

Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space

Page 33: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Repeats, and a Better Background Model

• Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc.

Solution:

more elaborate background model

0th order: B = { pA, pC, pG, pT }

1st order: B = { P(A|A), P(A|C), …, P(T|T) }

Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)

Page 34: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Limits of Motif Finders

• Given upstream regions of coregulated genes: Increasing length makes motif finding harder – random motifs clutter the

true ones Decreasing length makes motif finding harder – true motif missing in

some sequences

Motif Challenge problem:

Find a (15,4) motif in N sequences of length

0

gene???

Page 35: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Example Application: Motifs in Yeast

Group:

Tavazoie et al. 1999, G. Church’s lab, Harvard

Data:

• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles

1. Clustering genes according to common expression

• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function

2. AlignACE motif finding • 600-long upstream regions

Page 36: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Motifs in Periodic Clusters

Page 37: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Motifs in Non-periodic Clusters

Page 38: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Motifs are preferentially conserved across evolution

Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * *

Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACGSpar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACGSmik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATGSbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * *

Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTSpar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCTSmik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCTSbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * ***

Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAASpar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGACSmik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCCSbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** ***

Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--TSpar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--GSmik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--GSbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * **

Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TTSpar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TTSmik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----ATSbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** *

Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACTSpar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACTSmik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACASbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** *

Scer TTAA-CGTCAAGGA---GAAAAAACTATASpar TTAT-CGTCAAGGAAA-GAACAAACTATASmik TCGTTCATCAAGAA----AAAAAACTA..Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** **

Gal10 Gal1Gal4

GAL10

GAL1

TBP

GAL4 GAL4 GAL4

GAL4

MIG1

TBPMIG1

Factor footprint

Conservation islandslide credits: M. Kellis

Is this enough to discover motifs?No.

Page 39: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Comparison-based Regulatory Motif Discovery

Study known motifs

Derive conservation rules

Discover novel motifs

slide credits: M. Kellis

Page 40: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Known motifs are frequently conserved

• Across the human promoter regions, the Erra motif: appears 434 times is conserved 162 times

Human

Dog

Mouse

Rat

Erra Erra Erra

Conservation rate: 37%

• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Erra enrichment: 5.4-fold– Erra p-value < 10-50 (25 standard deviations under binomial)

Motif Conservation Score (MCS)

slide credits: M. Kellis

Page 41: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Finding conserved motifs in whole genomesM. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals

1. Define seed “mini-motifs”

2. Filter and isolate mini-motifs that are more conserved than average

3. Extend mini-motifs to full motifs

4. Validate against known databases of motifs & annotations

5. Report novel motifs

CT A C GAN

slide credits: M. Kellis

Page 42: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Test 1: Intergenic conservation

Total count

Con

serv

ed c

ount

CGG-11-CCG

slide credits: M. Kellis

Page 43: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Test 2: Intergenic vs. Coding

Coding Conservation

Inte

rgen

ic C

onse

rvat

ion

CGG-11-CCG

Higher Conservation in Genes

slide credits: M. Kellis

Page 44: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Test 3: Upstream vs. Downstream

CGG-11-CCG

Downstream motifs?

MostPatterns

Downstream Conservation

Ups

trea

m C

onse

rvat

ion

slide credits: M. Kellis

Page 45: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Extend

Collapse

Full Motifs

Constructing full motifs

2,000 Mini-motifs

72 Full motifs

6CT A C GAR R

CT GR C C GA AA CCTG C GA A

CT GR C C GA ACT RA Y C GA A

Y 5Extend Extend Extend

Collapse Collapse Collapse

Merge

Test 1 Test 2 Test 3

slide credits: M. Kellis

Page 46: CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays

CS262 Lecture 9, Win07, Batzoglou

Summary for promoter motifsRank Discovered Motif

Known TF motif

Tissue Enrichment

Distance bias

1 RCGCAnGCGY NRF-1 Yes Yes

2 CACGTG MYC Yes Yes

3 SCGGAAGY ELK-1 Yes Yes

4 ACTAYRnnnCCCR Yes Yes

5 GATTGGY NF-Y Yes Yes

6 GGGCGGR SP1 Yes Yes

7 TGAnTCA AP-1 Yes

8 TMTCGCGAnR Yes Yes

9 TGAYRTCA ATF3 Yes Yes

10 GCCATnTTG YY1 Yes

11 MGGAAGTG GABP Yes Yes

12 CAGGTG E12 Yes

13 CTTTGT LEF1 Yes

14 TGACGTCA ATF3 Yes Yes

15 CAGCTG AP-4 Yes

16 RYTTCCTG C-ETS-2 Yes Yes

17 AACTTT IRF1(*) Yes

18 TCAnnTGAY SREBP-1 Yes Yes

19 GKCGCn(7)TGAYG Yes Yes

20 GTGACGY E4F1 Yes Yes

21 GGAAnCGGAAnY Yes Yes

22 TGCGCAnK Yes Yes

23 TAATTA CHX10 Yes

24 GGGAGGRR MAZ Yes

25 TGACCTY ERRA Yes

• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias

75% have evidence

• Control sequences< 2% match known TF motifs

< 5% expression enrichment

< 3% show positional bias

< 7% false positives

Most discovered motifs are likely to be functional

NewNew

New

New

New

slide credits: M. Kellis