prediction of regulatory elements controlling gene expression

52
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.

Upload: sharne

Post on 15-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Prediction of Regulatory Elements Controlling Gene Expression. Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. Outline. Regulation of genes Motif discovery by overrepresentation MEME Gibbs sampling - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Prediction of Regulatory Elements Controlling  Gene Expression

1

Prediction of Regulatory Elements Controlling

Gene Expression

Martin Tompa

Computer Science & EngineeringGenome Sciences

University of WashingtonSeattle, Washington, U.S.A.

Page 2: Prediction of Regulatory Elements Controlling  Gene Expression

2

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 3: Prediction of Regulatory Elements Controlling  Gene Expression

3

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 4: Prediction of Regulatory Elements Controlling  Gene Expression

4

DNA, Genes, and Proteins

DNA: program for cell processes

Proteins: execute cell processes

TCCAA

CGGTGC

TGAGGT

GCAC

GeneProtein

DNA

Page 5: Prediction of Regulatory Elements Controlling  Gene Expression

5

Regulation of Genes

• What turns genes on (producing a protein) and off?

• When is a gene turned on or off?

• Where (in which cells) is a gene turned on?

• At what rate is the gene product produced?

Page 6: Prediction of Regulatory Elements Controlling  Gene Expression

6

Regulation of Genes

GeneRegulatory Element

Transcription Factor(Protein)

DNA

RNA polymerase

(Protein)

Page 7: Prediction of Regulatory Elements Controlling  Gene Expression

7

Regulation of Genes

DNA

Regulatory Element Gene

Transcription Factor(Protein)

RNA polymerase

(Protein)

Page 8: Prediction of Regulatory Elements Controlling  Gene Expression

8

Regulation of Genes RNA

polymerase(Protein)

DNA

New protein

Regulatory Element Gene

Transcription Factor(Protein)

Page 9: Prediction of Regulatory Elements Controlling  Gene Expression

9

GoalIdentify regulatory elements in DNA sequences. These are:

• Binding sites for proteins

• Short sequences (5-25 nucleotides)

• Up to 1000 nucleotides (or farther) from gene

• Inexactly repeating patterns (“motifs”)

Page 10: Prediction of Regulatory Elements Controlling  Gene Expression

10

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 11: Prediction of Regulatory Elements Controlling  Gene Expression

11

2 Types of Motif Discovery

1. Motif discovery by overrepresentation• One species

• Multiple (co-regulated) genes

2. Motif discovery by phylogenetic footprinting

• Multiple species

• One gene

Page 12: Prediction of Regulatory Elements Controlling  Gene Expression

12

Overrepresentation: Daf-19 Binding Sites in C. elegans

GTTGTCATGGTGACGTTTCCATGGAAACGCTACCATGGCAACGTTACCATAGTAACGTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6

F02D8.3-150 -1

Page 13: Prediction of Regulatory Elements Controlling  Gene Expression

13

Phylogenetic Footprinting:Regulatory Element of Growth Hormone Gene

-200 -1

Chicken

Rat

Human

Dog

Sheep

AGGGGATAAGGGTATAAGGGTATAAGGGTATAAGGGTATA

Page 14: Prediction of Regulatory Elements Controlling  Gene Expression

14

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 15: Prediction of Regulatory Elements Controlling  Gene Expression

15

MEME

• (Multiple EM for Motif Elicitation)

Bailey & Elkan, 1995

• Very general iterative method based on Expectation Maximization

• Available at meme.sdsc.edu/meme/website/intro.html

Page 16: Prediction of Regulatory Elements Controlling  Gene Expression

16

Overrepresented Motifs

• Given sequences X = {X1, X2, …, Xn},

find statistically overrepresented motifs of length k

• For simplicity, assume– Exactly one motif instance per sequence

– Sequences over DNA alphabet

Page 17: Prediction of Regulatory Elements Controlling  Gene Expression

17

Hidden Information

• Z = {Zij}, where

1, if motif instance starts at Zij = position j of Xi

0, otherwise• Iterate over probabilistic models that

could generate X and Z, trying to converge on this solution

{

Page 18: Prediction of Regulatory Elements Controlling  Gene Expression

18

Model Parameters

• Motif profile: 4×k matrix θ = (θrp),

r {A,C,G,T}

1 p k

θrp = Pr(residue r in position p of motif)

• Background distribution:

θr0 = Pr(residue r in random nonmotif

position)

Page 19: Prediction of Regulatory Elements Controlling  Gene Expression

19

Profile Example

GTTGTC 0 0 0 .4 0 0GTTTCC 0 .2 0 0 .8 1GCTACC 1 0 0 .2 0 0GTTACC 0 .8 1 .4 .2 0GTTTCC

profile θ

Page 20: Prediction of Regulatory Elements Controlling  Gene Expression

20

Overview: Expectation Maximization

• Goal: Find profile θ and motif positions Z that have maximum likelihood

• At each iteration:

– E-step: From θ predict likely motif positions Z

– M-step: From sequences at positions Z compute new profile θ

Page 21: Prediction of Regulatory Elements Controlling  Gene Expression

21

Expectation Maximization

• Goal: Find θ, Z that maximize Pr (X, Z | θ)

• At iteration t:– E-step: Z(t) = E (Z | X, θ(t))

– M-step: Find θ(t+1) that maximizes

Pr (X, Z(t) | θ(t+1))

Page 22: Prediction of Regulatory Elements Controlling  Gene Expression

22

E-step Details

Zij(t) = Pr(Xi | Zij=1, θ(t))

Σj Pr(Xi | Zij=1, θ(t))

Xi

j

Use θ1(t), θ2

(t), …, θk(t) Use θ0

(t)

Page 23: Prediction of Regulatory Elements Controlling  Gene Expression

23

M-step Details

• If Zij(t) {0,1} it would be straightforward:

Calculate profile θ1, θ2, …, θk from motif instances and θr0 from frequency of r outside of motif instances.

• But Zij(t) [0,1], so weight these

frequencies by the appropriate values of Zij

(t) .

Page 24: Prediction of Regulatory Elements Controlling  Gene Expression

24

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 25: Prediction of Regulatory Elements Controlling  Gene Expression

25

Gibbs Sampler

• Lawrence et al., 1993• Very general iterative method, related

to Markov Chain Monte Carlo (MCMC)• Available at bayesweb.wadsworth.org/gibbs/gibbs.html

Page 26: Prediction of Regulatory Elements Controlling  Gene Expression

26

One Iteration of Gibbs Sampler• n motif instances each of

length kGGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG

CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG

GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG

GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG

GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG

Page 27: Prediction of Regulatory Elements Controlling  Gene Expression

27

One Iteration of Gibbs Sampler• n motif instances each of

length k

• Remove one at random

• Form profile of remaining n-1

• Let pi be the probability with

which g[i .. i+k-1] fits profile

GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG

CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG

GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG

GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG

GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG

i

Page 28: Prediction of Regulatory Elements Controlling  Gene Expression

28

One Iteration of Gibbs Sampler• n motif instances each of

length k

• Remove one at random

• Form profile of remaining n-1

• Let pi be the probability with

which g[i .. i+k-1] fits profile

• Choose to start replacement at i with probability proportional to pi

GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAG

CACGGGGGAGCCTGGAGGGGATCCGGAGGGGTG

GGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAG

GGAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGG

GGCTGGGGTGGCGGTGGGAGCCCAGGACGTTG

i

Page 29: Prediction of Regulatory Elements Controlling  Gene Expression

29

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 30: Prediction of Regulatory Elements Controlling  Gene Expression

30

FootPrinter

• Blanchette & Tompa, 2002

• First algorithm explicitly designed for phylogenetic footprinting

• Available at bio.cs.washington.edu/software.html

Page 31: Prediction of Regulatory Elements Controlling  Gene Expression

31

Phylogenetic Footprinting(Tagle et al. 1988)

Functional regions of DNA evolve slower than nonfunctional ones.

Page 32: Prediction of Regulatory Elements Controlling  Gene Expression

32

Phylogenetic Footprinting(Tagle et al. 1988)

Functional regions of DNA evolve slower than nonfunctional ones.

• Consider a set of orthologous (i.e., corresponding) sequences from different species

• Identify unusually well conserved substrings (i.e., ones that have not changed much over the course of evolution)

Page 33: Prediction of Regulatory Elements Controlling  Gene Expression

33

CLUSTALW multiple sequence alignment (rbcS gene)Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC

Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA

Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA

Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

Page 34: Prediction of Regulatory Elements Controlling  Gene Expression

34

FootPrinter• Inputs:

– evolutionary tree T– corresponding regulatory regions at leaves

• Output: motifs well conserved w.r.t. T.

Page 35: Prediction of Regulatory Elements Controlling  Gene Expression

35

Finding Short Motifs

AGTCGTACGTGAC... (Human)

AGTAGACGTGCCG... (Chimp)

ACGTGAGATACGT... (Rabbit)

GAACGGAGTACGT... (Mouse)

TCGTGACGGTGAT... (Rat)

Size of motif sought: k = 4

Page 36: Prediction of Regulatory Elements Controlling  Gene Expression

36

Most Parsimonious Solution

“Parsimony score”: 1 mutation

AGTCGTACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...ACGGACGT

ACGT

ACGT

Page 37: Prediction of Regulatory Elements Controlling  Gene Expression

37

Substring Parsimony ProblemGiven:

• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d

Problem:

• Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d.

This problem is NP-hard.

Page 38: Prediction of Regulatory Elements Controlling  Gene Expression

38

FootPrinter’s Exact Algorithm(with Mathieu Blanchette, generalizing Sankoff and Rousseau

1975)

Wu [s] = best parsimony score for subtree rooted at node u,

if u is labeled with string s.

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG

… ACGG: 2 ACGT: 1 ...

… ACGG: 0 ACGT: 2...

… ACGG: 1 ACGT: 1 ...

ACGG: + ACGT: 0

...

… ACGG: 1 ACGT: 0 ...

4k entries

… ACGG: 0 ACGT: + ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

Page 39: Prediction of Regulatory Elements Controlling  Gene Expression

39

Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u

Running Time

Number of species

Average sequence

length

Motif length

Total time O(n k (4k + l ))

Page 40: Prediction of Regulatory Elements Controlling  Gene Expression

40

Improvements• Better algorithm reduces time from

O(n k (42k + l )) to O(n k (4k + l ))

• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)

• Amenable to many useful extensions (e.g., allow insertions and deletions)

Page 41: Prediction of Regulatory Elements Controlling  Gene Expression

41

Application to -actin Gene

Gilthead sea bream (678 bp)

Medaka fish (1016 bp)

Common carp (696 bp)

Grass carp (917 bp)

Chicken (871 bp)

Human (646 bp)

Rabbit (636 bp)

Rat (966 bp)

Mouse (684 bp)

Hamster (1107 bp)

Page 42: Prediction of Regulatory Elements Controlling  Gene Expression

42

Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG

AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA

GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC

ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG

TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA

GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA

TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT

HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT

TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT

GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC

CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG

Parsimony score over 10 vertebrates: 0 1 2

Page 43: Prediction of Regulatory Elements Controlling  Gene Expression

43

Motifs Absent from Some Species

• Find motifs – with small parsimony score

– that span a large part of the tree

• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs

Page 44: Prediction of Regulatory Elements Controlling  Gene Expression

44

Application to c-fos Gene

Asked for motifs of length 10, with 0 mutations over tree of

size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26

Puffer fish

Chicken

Pig

Mouse

Hamster

Human

10

2

7

2

2

21

0

1

1

Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28

Page 45: Prediction of Regulatory Elements Controlling  Gene Expression

45

Application to c-fos GeneMotif Score Conserved in Known?

CAGGTGCGAATGTTC 0 4 mammals

TTCCCGCCTCCCCTCCCC 0 4 mammals yes

GAGTTGGCTGcagcc 3 puffer + 4 mammals

GTTCCCGTCAATCcct 1 chicken + 4 mammals yes

CACAGGATGTcc 4 all 6 yes

AGGACATCTG 1 chicken + 4 mammals yes

GTCAGCAGGTTTCCACG 0 4 mammals yes

TACTCCAACCGC 0 4 mammals

metK in B. subtilis

Page 46: Prediction of Regulatory Elements Controlling  Gene Expression

46

Outline

• Regulation of genes

• Motif discovery by overrepresentation– MEME– Gibbs sampling

• Motif discovery by phylogenetic footprinting– FootPrinter– MicroFootPrinter

Page 47: Prediction of Regulatory Elements Controlling  Gene Expression

47

MicroFootPrinter

• Neph & Tompa, 2006

• Designed specifically for phylogenetic footprinting in prokaryotic genomes

• Front end to FootPrinter• Available at bio.cs.washington.edu/software.html

Page 48: Prediction of Regulatory Elements Controlling  Gene Expression

48

Microbial Footprinting• 1454 prokaryotes with genomes completely

sequenced (as of 2/17/2011)– For any prokaryotic gene of interest, plenty of close genes

in other species available– Relatively simple genomes

• MicroFootPrinter– undergraduate Computational Biology Capstone project– Goal: simple interface for microbiologists– User specifies species and gene of interest– Automates collection of orthologous genes, cis-regulatory

sequences, gene tree, parameters

Page 49: Prediction of Regulatory Elements Controlling  Gene Expression

49

Demo

• MicroFootPrinter home• Examples: Agrobacterium tumefaciens

genes regulated by ChvI (with Eugene Nester)

– chvI (two component response regulator)– ropB (outer membrane protein )

Page 50: Prediction of Regulatory Elements Controlling  Gene Expression

50

Sample chvI motifParsimony score: 2Span: 41.10Significance score: 4.22

B. henselae -151 GCTACAATTTR. etli -90 GCCACAATTTR. leguminosarum -106 GCCACAATTTS. meliloti -119 GCCACAATTTS. medicae -118 GCCACAATTTA. tumefaciens -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp. -87 GCCACATTTTO. anthropi -158 GCCACATTTTB. suis -38 GCCACATTTTB. melitensis -156 GCCACATTTTB. abortus -156 GCCACATTTTB. ovis -156 GCCACATTTTB. canis -38 GCCACATTTT

Page 51: Prediction of Regulatory Elements Controlling  Gene Expression

51

Sample ropB motifParsimony score: 1Span: 20.70Significance score: 1.34

Jannaschia sp. -151 CACATTTTGGR. etli -134 CACAATTTGGR. leguminosarum -135 CACAATTTGGA. tumefaciens -131 CACATTTTGGS. meliloti -128 CACATTTTGGS. medicae -128 CACATTTTGG

Page 52: Prediction of Regulatory Elements Controlling  Gene Expression

52

Combined ChvI MotifropB: CACATTTTGGchvI: GCCACAATTTAtu1221: TTGTCACAAT

ultimate: GYCACAWTTTGGY={C,T}

W={A,T}