co-modelling and conditional modelling observable unobservable goldman, thorne jones, 96 u c g a c...

26
Co-Modelling and Conditional Modelling Observabl e Observabl e Unobservable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 Pedersen, Meyer, Forsberg…, Simmonds 2004a,b McCauley …. Firth & Brown i . P ( SequenceStructure ) ii . P ( Structure ) ) ( ) ( ) ( ) ( Sequence P Sequence Structure P Structure P Structure Sequence P Conditional Modelling Needs: Footprinting -Signals (Blanchette) AGGTATATAATGCG..... P coding {ATG-->GTG} or AGCCATTTAGTGCG..... P non-coding {ATG-->GTG}

Upload: tyler-stephens

Post on 18-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

HMM Examples Simple Eukaryotic Simple Prokaryotic Gene Finding: Burge and Karlin, 1996 Intron length > 50 bp required for splicing Length distribution is not geometric

TRANSCRIPT

Page 1: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Co-Modelling and Conditional Modelling

Observable

Observable Unobservable

Unobservable

Goldman, Thorne & Jones, 96

UC G

AC

AU

AC

Knudsen.., 99

Eddy & co.

Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03Pedersen, Meyer, Forsberg…, Simmonds 2004a,b

McCauley ….

Firth & Brown

i. P(Sequence Structure)

ii. P(Structure)

)()(

)()(

SequencePSequenceStructureP

StructurePStructureSequenceP

• Conditional Modelling

Needs:Footprinting -Signals (Blanchette)

AGGTATATAATGCG..... Pcoding{ATG-->GTG} orAGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}

Page 2: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Ab Initio Gene predictionAb initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence. ....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgtacctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg....

Input data

5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCGCTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... 3'

5' 3'IntronExon UTR and intergenic sequence

Output:

Page 3: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

HMM ExamplesSimple Eukaryotic

Simple Prokaryotic

Gene Finding:Burge and Karlin, 1996

•Intron length > 50 bp required for splicing

•Length distribution is not geometric

Page 4: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Hidden Markov Models

• The marginal distribution of the Hi’s are described by a Homogenous Markov Chain: pi,j = P(Hk=i,Hk+1=j)• Let i =P{H1=i) - often i is the equilibrium distribution of the Markov Chain. • Conditional on Hk (all k), the Ok are independent.

e(i, j) P{Ok i Hk j)•The distribution of Ok only depends on the value of Hi and is called the emit function

O1 O2 O3 O4 O5 O6 O7 O8

O9 O10H1

H2

H3

• (O1,H1), (O2,H2),……. (On,Hn) is a sequence of stochastic variables with 2 components - one that is observed (Oi) and one that is hidden (Hi).

Page 5: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

What is the probability of the data?

PO5 iH5 2 P(O5 i H5 2) PO4

H4 j

H4 j p j,i

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1H2H3

P(O ) P(

H

O

H )P(

H )The probability of the observed is

, which could be hard to calculate. However, these calculations can be considerably accelerated. Let the probability of the observations (O1,..Ok) conditional on Hk=j. Following recursion will be obeyed:

POk iH k j

i. POk iH k j P(Ok i Hk j) POk 1

H k 1 r

H k 1 r pr,i

ii. PO1 iH1 j P(O1 i H1 j) j (initial condition)

iii. P(O) POn iHn j

Hn j

Page 6: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

What is the most probable ”hidden” configuration?

H61 max j{H6

1 * p j,1 *e(O6,1)}O1 O2 O3 O4 O5 O6 O7 O8 O9 O10H1

H2H3

Let be the sequences of hidden states in the most probably hidden path ie ArgMaxH[ ]. Let be the probability of the most probable path up to k ending in hidden state j.

P{O H}

Hkj

H*

ii. Hkj maxi{Hk 1

i pi, j}e(Ok, j)

i. H1j je(O1,1)

iii. Hk 1* {i Hk 1

i pi, je(Ok, j) HkH k

*

}

The actual sequence of hidden states can be found recursively by

Hk*

H5* {i H5

i * pi,1 *e(O6,1) H61}

Again recursions can be found:

Page 7: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Qkj P(Ok1 Hk1 i)p j,iQk1

i

H k1 i

What is the probability of specific ”hidden” state?

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

H1H2H3

P{H5 2) P52Q5

2 /P(O)

Let be the probability of the observations from k+1 to n given Hk=j. These will also obey recursions:

Qkj

P{O,Hk j) PkjQk

jThe probability of the observations and a specific hidden state can found as:

P{Hk j) PkjQk

j /P(O)And of a specific hidden state can found as:

Page 8: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

AGGTATATAATGCG..... Pcoding{ATG-->GTG} orAGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}

Comparative Gene Annotation

Page 9: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

ARs

WholeGenome

(Whole Genome – ARs)

Due to this work, people often say5% of the genome is constrained

~5% of the Human genome is under conservation(Chiaromonte et al.)

From Caleb Webber & Gerton Lunter

Page 10: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Consider lengths of inter-gap segments! Do they follow a geometric distribution?

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCACGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA

Percentage of Genome under Purifying Selection

Inter-gap distance (nucleotides)

Weighted regression: R2 > 0.9995

Log 1

0 counts

At most, only 0.09% of all ARs are under selection.

Log 1

0 counts

Inter-gap distance (nucleotides)

Overrepresentation of long inter-gap distances:Reduced indel rate due toindel-purifying selection

From Caleb Webber & Gerton Lunter

Page 11: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Finding Regulatory Signals in Genomes

Searching for unknown signal common to set of unrelated sequences

Searching for known signal in 1 sequence

Searching for conserved segments in homologous

Combining homologous and non-homologous analysis

Challenges

Merging Annotations

Predicting signal-regulatory protein relationships

mouse

pig human

Page 12: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Weight Matrices & Sequence Logos

Wasserman and Sandelin (2004) ‘Applied Bioinformatics for the Identification of Regulatory

Elements” Nature Review Genetics 5.4.276

1 2 3 4 5 6 7 8 9 10 11 12 13 141 G A C C A A A T A A G G C A2 G A C C A A A T A A G G C A3 T G A C T A T A A A A G G A4 T G A C T A T A A A A G G A5 T G C C A A A A G T G G T C6 C A A C T A T C T T G G G C7 C A A C T A T C T T G G G C8 C T C C T T A C A T G G G C

Set of signal sequences:

B R M C W A W H R W G G B MConsensus sequence:

A 0 4 4 0 3 7 4 3 5 4 2 0 0 4C 3 0 4 8 0 0 0 3 0 0 0 0 0 4G 2 3 0 0 0 0 0 0 1 0 6 8 5 0T 3 1 0 0 5 1 4 2 2 4 0 0 1 0

Position Frequency Matrix - PFM

corrected probability : p(b,i) fb,i s(b)

N s(b')b 'nucleo

fb,i b's in position i, s(b) pseudo count.

A -1.93 .79 .79 -1.93 .45 1.50 .79 .45 1.07 .79 .0 -1.93 -1.93 .79C .45 -1.93 .79 1.68 -1.93 -1.93 -1.93 .45 -1.93 -1.93 -1.93 -1.93 .0 .79G .0 .45 -1.93 -1.93 -1.93 -1.93 -1.93 -1.93 .66 -1.93 1.3 1.68 1.07 -1.93T .15 .66 -1.93 -1.93 1.07 .66 .79 .0 .79 -1.93 -1.93 -1.93 .66 -1.93 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Position Weight Matrix - PWM

PWM :Wb,i log2p(b,i)p(b)

T T G C A T A A G T A G T C.45 -.66 .79 1.66 .45 -.66 .79 .45 -.66 .79 .0 1.68 -.66 .79

Score for New Sequence

S Wb,il1

w

Sequence Logo & Information content

Di 2 pb,i log2 pb,ib

Page 13: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Motifs in Biological Sequences1990 Lawrence & Reilly “An Expectation Maximisation (EM) Algorithm for the identification and Characterization of Common Sites in Unaligned Biopolymer Sequences Proteins 7.41-51.

1992 Cardon and Stormo Expectation Maximisation Algorithm for Identifying Protein-binding sites with variable lengths from Unaligned DNA Fragments L.Mol.Biol. 223.159-170

1993 Lawrence… Liu “Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment” Science 262, 208-214.

(R,l)1

K

A=(a1,..,aK) – positions of the windows

w

j

RhjRh

w

j

Rhj

RhjA

jAcAARp1

)(

0

)(0

1

)()(00

1

1}{),,|(

Priors A has uniform prior

j has Dirichlet(N0) prior – base frequency in genome. N0 is pseudocounts 0.0 1.0

(,)

(,)

(,)

(,)

=(1,A,…,w,T) probability of different bases in the window

0=(A,..,T) – background frequencies of nucleotides.

Page 14: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Natural Extensions to Basic Model IMultiple Pattern Occurances in the same sequences:Liu, J. `The collapsed Gibbs sampler with applications to a gene regulation problem," Journal of the American Statistical Association 89 958-966.

width = w

length nLak

),,( 1 kaaA s)constraint pingnonoverlap(with )1()( 00kNk ppAP

Prior: any position i has a small probability p to start a binding site:

Modified from Liu

Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics

Page 15: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Natural Extensions to Basic Model IICorrelated in Nucleotide Occurrence in Motif: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 6, 909-916.

Regulatory Modules:De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84

M1

M2

M3

Stop

Start12p

21p

Gene AGene B

Insertion-Deletion BALSA: Bayesian algorithm for local sequence alignment Nucl. Acids Res., 30 1268-77.

1

K

w1

w2

w3

w4

Page 16: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Combining Signals and other Data

Modified from Liu

1.Rank genes by E=log2(expression fold change)2.Find “many” (hundreds) candidate motifs3.For each motif pattern m, compute the vector Sm of matching scores

for genes with the pattern 4.Regress E on Sm

Yg mSmg g

Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci. 100.3339-44

Motifs Coding regions

ChIP-on-chip - 1-2 kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, 835-39

Protein binding in neighborhood

Coding regions

Page 17: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Phylogenetic Footprinting (homologous detection)Blanchette and Tompa (2003) “FootPrinter: a program designed for phylogenetic footprinting” NAR 31.13.3840-

Dibegin min{Di,

begin d(i,)}

Disignal,1 min{Di,

begin d(i,)}

Disignal, j min{Di,

signal, j 1 d(i,)}...

Diend min{Di,

end d(i,)}

Term originated in 1988 in Tagle et al. Blanchette et al.: For unaligned sequences related by phylogenetic tree, find all segments of length k with a history costing less than d. Motif loss an option.

begin

signal

end

Page 18: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

The Basics of Footprinting

•Many aligned sequences related by a known phylogeny:

n1positions

sequenc

esk

1

slow - rsfast - rf

HMM:•Two un-aligned sequences:

A

A

GC

T ATG

A-C

HMM:

Page 19: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

•Many un-aligned sequences related by a known phylogeny:• Conceptually simple, computationally hard• Dependent on a single alignment/no measure of uncertainty

sequenc

es

k

1

Alignment HMM

acgtttgaaccgag----

Signal HMM

Alignment HMM

sequenc

es

k

1 acgtttgaaccgag----

Statistical Alignment and Footprinting.

sequen

ces

k

1 acgtttgaaccgag----

Solution:

Cartesian Product of HMMs

Page 20: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

SAPF - Statistical Alignment and Phylogenetic Footprinting

Signal HMM

Alignment HMM

12

Target

Sum out

Annotate

Page 21: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

BigFoothttp://www.stats.ox.ac.uk/research/genome/software

• Dynamical programming is too slow for more than 4-6 sequences

• MCMC integration is used instead – works until 10-15 sequences

• For more sequences other methods are needed.

Page 22: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

FSA - Fast Statistical Alignment Pachter, Holmes & Co

http://math.berkeley.edu/~rbradley/papers/

manual.pdf

12

k

1

3

k

2

4

Spanning tree

Additional edges

An edge – a pairwise alignment12

Data – k genomes/sequences:

1,3 2,3 3,4 3,k

12 2,k 1,4 4,k

Iterative addition of homology statements to shrinking alignment:

Add most certain homology statement from pairwise alignment compatible with present multiple alignment

i. Conflicting homology statements cannot be added

ii. Some scoring on multiple sequence homology statements is used.

Page 23: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Rate of Molecular Evolution versus estimated Selective Deceleration

Halpern and Bruno (1998) “Evolutionary Distances for Protein-Coding Sequences” MBE 15.7.910- & Moses et al.(2003) “Position specific variation in the rate fo evolution of transcription binding sites” BMC Evolutionary Biology 3.19-

A C G TA - qA,C qA,G qA,T

C qC,A - qC, G qC,T

G qG,A qG,C - qG,T

T qT,A qT,C qT,G -

Neutral Process

Neutral Equilibrium (A,C,G,T)

A C G TA - q’A,C q’A,G q’A,T

C q’C,A - q’C, G q’C,T G q’G,A q’G,C - q’G,T

T q’T,A q’T,C q’T,G -

Selected Process

Observed Equilibrium (A,C,G,

T)’

How much selection?Selection => deceleration

Page 24: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Signal Factor Prediction

http://jaspar.cgb.ki.se/ http://www.gene-regulation.com/

• Given set of homologous sequences and set of transcription factors (TFs), find signals and which TFs they bind to.

• Use PWM and Bruno-Halpern (BH) method to make TF specific evolutionary models

• Drawback BH only uses rates and equilibrium distribution

• Superior method: Infer TF Specific Position Specific evolutionary model

• Drawback: cannot be done without large scale data on TF-signal binding.

Page 25: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

Knowledge Transfer and Combining Annotations

mouse pig

human

Experimental observations

• Annotation Transfer

• Observed Evolution

Must be solvable by Bayesian Priors Each position pi probability of being j’th position in k’th TFBS If no experiment, low probability for being in TFBS

prio

r

1 experimentally annotated genome (Mouse)

Page 26: Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne  Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy  co. Meyer and Durbin

(Homologous + Non-homologous) detection

Wang and Stormo (2003) “Combining phylogenetic data with co-regulated genes to identify regulatory motifs” Bioinformatics 19.18.2369-80

Zhou and Wong (2007) Coupling Hidden Markov Models for discovery of cis-regulatory signals in multiple species Annals Statistics 1.1.36-65

genepromotor

Unrelated genes - similar expression

Related genes - similar expression

Combine above approaches

Combine “profiles”