cs262 lecture 9, win07, batzoglou gene regulation and microarrays
Post on 19-Dec-2015
215 views
TRANSCRIPT
CS262 Lecture 9, Win07, Batzoglou
Gene Regulation and Microarrays
CS262 Lecture 9, Win07, Batzoglou
Finding Regulatory Motifs
CS262 Lecture 9, Win07, Batzoglou
Regulatory Motif Discovery
DNA
Group of co-regulated genesCommon subsequence
Find motifs within groups of corregulated genes
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Characteristics of Regulatory Motifs
• Tiny
• Highly Variable
• ~Constant Size Because a constant-size
transcription factor binds
• Often repeated
• Low-complexity-ish
CS262 Lecture 9, Win07, Batzoglou
Sequence Logos
Height of each letter proportional to its frequency
Height of all letters proportional to information content at that position
CS262 Lecture 9, Win07, Batzoglou
Problem Definition
Probabilistic
Motif: Mij; 1 i W
1 j 4Mij = Prob[ letter j, pos i ]
Find best M, and positions p1,…, pN in sequences
Combinatorial
Motif M: m1…mW
Some of the mi’s blank
Find M that occurs in all si with k differences
Given a collection of promoter sequences s1,…, sN of genes with common expression
CS262 Lecture 9, Win07, Batzoglou
Discrete Approaches to Motif Finding
CS262 Lecture 9, Win07, Batzoglou
Discrete Formulations
Given sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:
d(W, xi) = min hamming dist. between W and any word in xi
d(W, S) = i d(W, xi)
CS262 Lecture 9, Win07, Batzoglou
Exhaustive Searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T (4K possibilities)
Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )
(where N = i |xi|)
Advantage: Finds provably “best” motif W
Disadvantage: Time
CS262 Lecture 9, Win07, Batzoglou
Exhaustive Searches
2. Sample-driven algorithm:
For W = any K-long word occurring in some xi
Find d( W, S )
Report W* = argmin( d( W, S ) )or, Report a local improvement of W*
Running time: O( K N2 )
Advantage: Time
Disadvantage: If the true motif is weak and does not occur in data
then a random motif may score better than any instance of true motif
CS262 Lecture 9, Win07, Batzoglou
MULTIPROFILER
• Extended sample-driven approach
Given a K-long word W, define:
Nα(W) = words W’ in S s.t. d(W,W’) α
Idea:
Assume W is occurrence of true motif W*
Will use Nα(W) to correct “errors” in W
CS262 Lecture 9, Win07, Batzoglou
MULTIPROFILER
Assume W differs from true motif W* in at most L positions
Define:
A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K
Example:
K = 7; L = 3
W = ACGTTGAG = --A--CG
CS262 Lecture 9, Win07, Batzoglou
MULTIPROFILER
Algorithm:
For each W in S:For L = 1 to Lmax
1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,
1. Modify W by the wordlet G W’2. Compute d(W’, S)
Report W* = argmin d(W’, S)
Step 2 above: Smaller motif-finding problem; Use exhaustive search
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization in Motif Finding
CS262 Lecture 9, Win07, Batzoglou
All K-long wordsmotif background
Expectation Maximization
Algorithm (sketch):
1. Given genomic sequences find all k-long words
2. Assume each word is motif or background
3. Find likeliest Motif Model
Background Model
classification of words into either Motif or Background
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization
Given sequences x1, …, xN,
• Find all k-long words X1,…, Xn
• Define motif model:
M = (M1,…, MK)
Mi = (Mi1,…, Mi4)(assume {A, C, G, T})
where Mij = Prob[ letter j occurs in motif position i ]
• Define background model:
B = B1, …, B4
Bi = Prob[ letter j in background sequence ]
motif background
ACGT
M1 MKM1 B
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization
• Define
Zi1 = { 1, if Xi is motif;
0, otherwise }
Zi2 = { 0, if Xi is motif;
1, otherwise }
• Given a word Xi = x[s]…x[s+k],
P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]
P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]
Let 1 = ; 2 = (1 – )
motif background
ACGT
M1 MKM1 B
1 –
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization
Define:
Parameter space = (M, B)
1: Motif; 2: Background
Objective:
Maximize log likelihood of model:
2
1
2
111
1
2
11
log)|(log
))|(log(),|,...(log
j jjij
n
ijiij
n
i
n
i jjijijn
ZZ
Z
XP
XPZXXP
ACGT
M1 MKM1 B
1 –
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization
• Maximize expected likelihood, in iteration of two steps:
Expectation:
Find expected value of log likelihood:
Maximization:
Maximize expected value over ,
)],|,...([log 1 ZXXPE n
CS262 Lecture 9, Win07, Batzoglou
Expectation:
Find expected value of log likelihood:
2
1
2
111
1
log][)|(log][
)],|,...([log
j jjij
n
ijiij
n
i
n
ZZ EXPE
ZXXPE
where expected values of Z can be computed as follows:
ijii
jijijij Z
XPXP
XPZobZE *
)|()1()|(
)|(]1[Pr][
21
Expectation Maximization: E-step
CS262 Lecture 9, Win07, Batzoglou
Expectation Maximization: M-step
Maximization:
Maximize expected value over and independently
For , this has the following solution:(we won’t prove it)
Effectively, NEW is the expected # of motifs per position, given our current parameters
n
i
n
i
iii
NEW
n
Zxam ZZ
1 1
121
*))1log(log(arg **
CS262 Lecture 9, Win07, Batzoglou
• For = (M, B), define
cjk = E[ # times letter k appears in motif position j]
c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values
It then follows:
4
1k jk
jkNEWjk
c
cM
4
1 0
0
k k
kNEWk
c
cB
to not allow any 0’s, add pseudocounts
Expectation Maximization: M-step
CS262 Lecture 9, Win07, Batzoglou
Initial Parameters Matter!
Consider the following artificial example:
6-mers X1, …, Xn: (n = 2000)
990 words “AAAAAA” 990 words “CCCCCC” 20 words “ACACAC”
Some local maxima:
l = 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA
l = 1%; B = 50% C, 50% A M = 100% ACACAC
CS262 Lecture 9, Win07, Batzoglou
Overview of EM Algorithm
1. Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K)
2. Repeat:
a. Expectation
b. Maximization
3. Until change in = (M, B), falls below
4. Report results for several “good”
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling in Motif Finding
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
• Given: x1, …, xN, motif length K, background B,
• Find: Model M Locations a1,…, aN in x1, …, xN
Maximizing log-odds likelihood ratio:
N
i
K
kika
ika
i
i
xB
xkM
1 1 )(
),(log
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE
Algorithm (sketch):1. Initialization:
a. Select random locations in sequences x1, …, xN
b. Compute an initial model M from these locations
2. Sampling Iterations:a. Remove one sequence xi
b. Recalculate modelc. Pick a new location of motif in xi according to probability the location is a
motif occurrence
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
Initialization:• Select random locations 1,…, N in x1, …, xN
• For these locations, compute M:
))((1
1
N
ikajkj jx
BNM
i
where j are pseudocounts to avoid 0s,
and B = j j
• That is, Mkj is the number of occurrences of letter j in motif position k, over the total
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
Predictive Update:
• Select a sequence x = xi
• Remove xi, recompute model:
))(()1(
1
,1
N
isskajkj jx
BNM
s
where j are pseudocounts to avoid 0s,
and B = j j
M
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
Sampling:
For every K-long word xj,…,xj+k-1 in x:
Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)
Pi = Prob[ word | background ] B(xj)…B(xj+k-1)
Let
Sample a random new position ai
according to the probabilities A1,…, A|x|-k+1.
1||
1
/
/kx
jjj
jjj
PQ
PQA
0 |x|
Prob
CS262 Lecture 9, Win07, Batzoglou
Gibbs Sampling
Running Gibbs Sampling:
1. Initialize
2. Run until convergence
3. Repeat 1,2 several times, report common motifs
CS262 Lecture 9, Win07, Batzoglou
Advantages / Disadvantages
• Very similar to EM
Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics
Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space
CS262 Lecture 9, Win07, Batzoglou
Repeats, and a Better Background Model
• Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc.
Solution:
more elaborate background model
0th order: B = { pA, pC, pG, pT }
1st order: B = { P(A|A), P(A|C), …, P(T|T) }
…
Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
CS262 Lecture 9, Win07, Batzoglou
Limits of Motif Finders
• Given upstream regions of coregulated genes: Increasing length makes motif finding harder – random motifs clutter the
true ones Decreasing length makes motif finding harder – true motif missing in
some sequences
Motif Challenge problem:
Find a (15,4) motif in N sequences of length
0
gene???
CS262 Lecture 9, Win07, Batzoglou
Example Application: Motifs in Yeast
Group:
Tavazoie et al. 1999, G. Church’s lab, Harvard
Data:
• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles
1. Clustering genes according to common expression
• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function
2. AlignACE motif finding • 600-long upstream regions
CS262 Lecture 9, Win07, Batzoglou
Motifs in Periodic Clusters
CS262 Lecture 9, Win07, Batzoglou
Motifs in Non-periodic Clusters
CS262 Lecture 9, Win07, Batzoglou
Motifs are preferentially conserved across evolution
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * *
Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACGSpar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACGSmik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATGSbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * *
Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTSpar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCTSmik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCTSbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * ***
Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAASpar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGACSmik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCCSbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** ***
Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--TSpar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--GSmik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--GSbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * **
Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TTSpar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TTSmik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----ATSbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** *
Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACTSpar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACTSmik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACASbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** *
Scer TTAA-CGTCAAGGA---GAAAAAACTATASpar TTAT-CGTCAAGGAAA-GAACAAACTATASmik TCGTTCATCAAGAA----AAAAAACTA..Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** **
Gal10 Gal1Gal4
GAL10
GAL1
TBP
GAL4 GAL4 GAL4
GAL4
MIG1
TBPMIG1
Factor footprint
Conservation islandslide credits: M. Kellis
Is this enough to discover motifs?No.
CS262 Lecture 9, Win07, Batzoglou
Comparison-based Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Known motifs are frequently conserved
• Across the human promoter regions, the Erra motif: appears 434 times is conserved 162 times
Human
Dog
Mouse
Rat
Erra Erra Erra
Conservation rate: 37%
• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Erra enrichment: 5.4-fold– Erra p-value < 10-50 (25 standard deviations under binomial)
Motif Conservation Score (MCS)
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Finding conserved motifs in whole genomesM. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals
1. Define seed “mini-motifs”
2. Filter and isolate mini-motifs that are more conserved than average
3. Extend mini-motifs to full motifs
4. Validate against known databases of motifs & annotations
5. Report novel motifs
CT A C GAN
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Test 1: Intergenic conservation
Total count
Con
serv
ed c
ount
CGG-11-CCG
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Test 2: Intergenic vs. Coding
Coding Conservation
Inte
rgen
ic C
onse
rvat
ion
CGG-11-CCG
Higher Conservation in Genes
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Test 3: Upstream vs. Downstream
CGG-11-CCG
Downstream motifs?
MostPatterns
Downstream Conservation
Ups
trea
m C
onse
rvat
ion
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Extend
Collapse
Full Motifs
Constructing full motifs
2,000 Mini-motifs
72 Full motifs
6CT A C GAR R
CT GR C C GA AA CCTG C GA A
CT GR C C GA ACT RA Y C GA A
Y 5Extend Extend Extend
Collapse Collapse Collapse
Merge
Test 1 Test 2 Test 3
slide credits: M. Kellis
CS262 Lecture 9, Win07, Batzoglou
Summary for promoter motifsRank Discovered Motif
Known TF motif
Tissue Enrichment
Distance bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1(*) Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
20 GTGACGY E4F1 Yes Yes
21 GGAAnCGGAAnY Yes Yes
22 TGCGCAnK Yes Yes
23 TAATTA CHX10 Yes
24 GGGAGGRR MAZ Yes
25 TGACCTY ERRA Yes
• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias
75% have evidence
• Control sequences< 2% match known TF motifs
< 5% expression enrichment
< 3% show positional bias
< 7% false positives
Most discovered motifs are likely to be functional
NewNew
New
New
New
slide credits: M. Kellis