motif finding pssms expectation maximization gibbs sampling

85
Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Upload: melvin-robinson

Post on 01-Jan-2016

245 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Motif Finding

PSSMs

Expectation Maximization

Gibbs Sampling

Page 2: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Complexity of Transcription

Page 3: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Representing Binding Sites for a TF

A set of sites represented as a consensus VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

A matrix describing a a set of sites

A single site AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Page 4: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Nucleic acid codes

code description

A Adenine

C Cytosine

G Guanine

T Thymine

U Uracil

R Purine (A or G)

Y Pyrimidine (C, T, or U)

M C or A

K T, U, or G

W T, U, or A

S C or G

B C, T, U, or G (not A)

D A, T, U, or G (not C)

H A, T, U, or C (not G)

V A, C, or G (not T, not U)

N Any base (A, C, G, T, or U)

Page 5: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

From frequencies to log scores

TGCTG = 0.9

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) + s(N)p(b)

Page 6: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

TFs do not act alone

http://www.bioinformatics.ca/

Page 7: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

PSSMs for Liver TFs…

HNF1

C/EBP

HNF3

HNF4

Page 8: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

PSSMs for Helix-Turn-Helix Motif

Page 9: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Promoter…

Page 10: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Promoter Weight Matrices (PWM)

Page 11: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

E.Coli PWMs

Page 12: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Motif Logo Motifs can mutate on

less important bases. The five motifs at top

right have mutations in position 3 and 5.

Representations called motif logos illustrate the conserved regions of a motif.

http://weblogo.berkeley.eduhttp://fold.stanford.edu/eblocks/acsearch.html

1234567TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

Position:

Page 13: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Example: Calmodulin-Binding Motif (calcium-binding proteins)

Page 14: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Sequence Motifs

• Motifs represent a short common sequence– Regulatory motifs (TF binding sites)

– Functional site in proteins (DNA binding motif)

http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html

Page 15: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Regulatory Motifs

Transcription Factors bind to regulatory motifs Motifs are 6 – 20 nucleotides long Activators and repressors Usually located near target gene, mostly

upstream

Page 16: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Challenges

How to recognize a regulatory motif? Can we identify new occurrences of

known motifs in genome sequences? Can we discover new motifs within

upstream sequences of genes?

Page 17: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Motif Representation

Exact motif: CGGATATA Consensus: represent only

deterministic nucleotides. Example: HAP1 binding

sites in 5 sequences. consensus motif:

CGGNNNTANCGG N stands for any nucleotide.

Representing only consensus loses information. How can this be avoided?

CGGATATACCGG

CGGTGATAGCGG

CGGTACTAACGG

CGGCGGTAACGG

CGGCCCTAACGG

------------

CGGNNNTANCGG

Page 18: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

1 2 3 4 5

A 10 25 5 70 60

C 30 25 80 10 15

T 50 25 5 10 5

G 10 25 10 10 20

PSPM – Position Specific Probability Matrix

Represents a motif of length k (5) Count the number of occurrence of each

nucleotide in each position

Page 19: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

Defines Pi{A,C,G,T} for i={1,..,k}. Pi (A) – frequency of nucleotide A in position i.

Page 20: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Identification of Known Motifs within Genomic Sequences

Motivation: identification of new genes controlled by the

same TF. Infer the function of these genes. enable better understanding of the regulation

mechanism.

Page 21: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

Each k-mer is assigned a probability. Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

Page 22: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example:

sequence = ATGCAAGTCT…

Page 23: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Page 24: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4 Position 2: TGCAA

0.5*0.25*0.8*0.7*0.6=0.042

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Page 25: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Detecting a Known Motif within a Sequence using PSSM

Is it a random match, or is it indeed an occurrence of the motif?

PSPM -> PSSM (Probability Specific Scoring Matrix) odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} defined as Pi(n)/P(n), where P(n) is background

frequency. Oi(n) increases => higher odds that n at position i is

part of a real motif.

Page 26: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

1 2 3 4 5

A 0.4 1 0.2 2.8 2.4

1 2 3 4 5

A -1.322 0 -2.322 1.485

1.263

PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is

0.25.

Original PSPM (Pi):

Odds Matrix (Oi):

Going to log scale we get an additive score,Log odds Matrix (log2Oi):

Page 27: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

1 2 3 4 5

A -1.32 0 -2.32 1.48 1.26

C 0.26 0 1.68 -1.32 -0.74

T 1 0 -2.32 -1.32 -2.32

G -1.32 0 -1.32 -1.32 -0.32

Calculating using Log Odds Matrix

Odds 0 implies random match; Odds > 0 implies real match (?).

Example: sequence = ATGCAAGTCT… Position 1: ATGCA

-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15

Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

Page 28: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Calculating the probability of a match

ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18

P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003

P (1)= 0.003P (2)= 0.993P (3) =0.004

Page 29: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Building a PSSM

Collect all known sequences that bind a certain TF.

Align all sequences (using multiple sequence alignment).

Compute the frequency of each nucleotide in each position (PSPM).

Incorporate background frequency for each nucleotide (PSSM).

Page 30: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Finding new Motifs

We are given a group of genes, which presumably contain a common regulatory motif.

We know nothing of the TF that binds to the putative motif.

The problem: discover the motif.

Page 31: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Example

Predicting the cAMP Receptor Protein (CRP) binding site motif

Page 32: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA

Extract experimentally defined CRP Binding Sites

Page 33: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA

Create a Multiple Sequence Alignment

Page 34: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

A C G T

1 -0.43 0.1 -0.46 0.55

2 1.37 0.12 -1.59 -11.2

3 1.69 -1.28 -11.2 -1.43

4 -1.28 0.12 -11.2 1.32

5 0.91 -11.2 -0.46 0.47

6 1.53 -1.38 -1.48 -1.43

7 0.9 -0.48 -11.2 0.12

8 -1.37 -1.28 -11.2 1.68

9 -11.2 -11.2 1.73 -0.56

10 -11.2 -0.51 -11.2 1.72

11 -0.48 -11.2 1.72 -11.2

12 1.56 -1.59 -11.2 -0.46

13 -0.51 -0.38 -0.55 0.88

14 -11.2 0.5 0.57 0.13

15 0.17 -0.51 0.12 0.12

16 0.9 -11.2 0.5 -0.48

17 0.17 0.16 0.06 -0.48

18 -0.4 -0.38 0.82 -0.48

19 -1.38 -1.28 -11.2 1.68

20 -1.48 1.7 -11.2 -1.38

21 1.5 -1.38 -1.43 -1.28

Generate a PSSM

Page 35: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Shannon Entropy

Expected variation per column can be calculated

Low entropy means higher conservation

Page 36: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Entropy

The entropy (H) for a column is:

a: is a residue, fa: frequency of residue a in a column,

pa : probability of residue a in that column

)(

)log(aresidues

aa pfH

Page 37: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Entropy

entropy measures can determine which evolutionary distance (PAM250, BLOSUM80, etc) should be used

Entropy yields amount of information per column (discussed with sequence logos in a bit)

Page 38: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Log-odds score

Profiles can also indicate log-odds score: Log2(observed:expected)

Result is a bit score

Page 39: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Matlab

Multalign1 Enter an array of sequences.seqs =

{'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};

2 Promote terminations with gaps in the alignment.multialign(seqs,'terminalGapAdjust',true)

ans =--CACGTAACATCTC--ACGACGTAACATCTTCT-AAACGTAACATCTCGC

Page 40: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Matlab

3 Compare alignment without termination gap adjustment.

multialign(seqs)

ans =

CA--CGTAACATCT--C

ACGACGTAACATCTTCT

AA-ACGTAACATCTCGC

Page 41: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Matlab

>> a={'ATATAGGAG','AATTATAGA','TTAGAGAAA'}

>> a =

'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'

Page 42: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Char function

>> cseq=char(a)

cseq =

ATATAGGAG

AATTATAGA

TTAGAGAAA

Page 43: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Double function

>> intseq=double(cseq)

intseq =

65 84 65 84 65 71 71 65 71

65 65 84 84 65 84 65 71 65

84 84 65 71 65 71 65 65 65

Page 44: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

double

>> double('A')ans = 65>> double('C')ans = 67>> double('G')ans = 71>> double('T')ans = 84

Page 45: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Initiate PSPM matrix

>> Pspm=zeros(4,length(intseq))

Pspm =

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 46: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Use a for loop to count each nucleotide at each position>> for i = 1:length(intseq)Pspm(1,i)=length(find(intseq(:,i)==65));Pspm(2,i)=length(find(intseq(:,i)==67));Pspm(3,i)=length(find(intseq(:,i)==71));Pspm(4,i)=length(find(intseq(:,i)==84));end>> Pspm

Pspm =

2 1 2 0 3 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 1 1 1 2 1 2 0 1 0 0 0

Page 47: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Add pseudocounts

>> Pspmp=Pspm+1

Pspmp =

3 2 3 1 4 1 3 3 3

1 1 1 1 1 1 1 1 1

1 1 1 2 1 3 2 2 2

2 3 2 3 1 2 1 1 1

Page 48: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Normalize to get frequencies>> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1)

Pspmnorm =

Columns 1 through 7

0.4286 0.2857 0.4286 0.1429 0.5714 0.1429 0.4286 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.2857 0.1429 0.4286 0.2857 0.2857 0.4286 0.2857 0.4286 0.1429 0.2857 0.1429

Columns 8 through 9

0.4286 0.4286 0.1429 0.1429 0.2857 0.2857 0.1429 0.1429

Page 49: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Calculate odds score>> Pswm=Pspmnorm/0.25

Pswm =

Columns 1 through 7

1.7143 1.1429 1.7143 0.5714 2.2857 0.5714 1.7143 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 1.1429 0.5714 1.7143 1.1429 1.1429 1.7143 1.1429 1.7143 0.5714 1.1429 0.5714

Columns 8 through 9

1.7143 1.7143 0.5714 0.5714 1.1429 1.1429 0.5714 0.5714

Page 50: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Log odds ratio>> logPswm=log2(Pswm)

logPswm =

Columns 1 through 7

0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074

Columns 8 through 9

0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

Page 51: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Estimate the probability of the given sequence to belong to the defined PSWM

>> Unknown='TTAAGAAGG'

Unknown =

TTAAGAAGG

>> intunknown=double(Unknown)

intunknown =

84 84 65 65 71 65 65 71 71

Page 52: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Get the index of the PSWM for the unknown sequence>> for i=1:length(intunknown)

A=find(intunknown==65)intunknown(A)=1;C=find(intunknown==67)intunknown(C)=2;G=find(intunknown==71)intunknown(G)=3;T=find(intunknown==84)intunknown(T)=4;

end>> intunknownintunknown =

4 4 1 1 3 1 1 3 3

Page 53: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Calculate the log odds-ratio of the Unknown 'TTAAGAAGG'

>> logunknown=logPswm(intunknown)

logunknown =

Columns 1 through 7

0.1926 0.1926 0.7776 0.7776 -0.8074 0.7776 0.7776

Columns 8 through 9

-0.8074 -0.8074

>> Punknown=sum(logunknown)

Punknown =

1.0737

Page 54: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Is this significant score or just random similarity?

>> cseqcseq =

ATATAGGAGAATTATAGATTAGAGAAA

>> Unknown

Unknown =

TTAAGAAGG

Page 55: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

What would be the maximum score?

>> logPswm

logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

>> maxscore=max(logPswm)maxscore =Columns 1 through 7 0.7776 0.7776 0.7776 0.7776 1.1926 0.7776 0.7776Columns 8 through 9 0.7776 0.7776>> totalmaxscore=sum(maxscore)

totalmaxscore=

7.4135

Page 56: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Write a function using the above statements to scan a sequence

Write a function named ‘logodds’ that calculates the logs-odd ratio of a given alignment.

Write a function named ‘scanmotif’ that calls the ‘logodds’ to search through a sequence using a sliding window to calculate the logodds of a subsequence and store these scores. The function should allow for selection of a maximum number of locations that are likely to contain the motif based on the scores obtained.

Page 57: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Position Specific Scoring Matrix (PSSM) incorporate information theory to

indicate information contained within each column of a multiple alignment.

information is a logarithmic transformation of the frequency of each residue in the motif

Page 58: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

PSSMs and Pseudocounts

Problem: PSSMs are only as good as the initial msa Some residues may be underrepresented Other columns may be too conserved

Solution: Introduce Pseudocounts to get a better indication

Page 59: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Pseudocounts

New estimated probability:

Pca: Probability of residue a in column c nca: count of a’s in column c bca: pseudocount of a’s in column c Nc: total count in column c Bc: total pseudocount in column c

cc

cacaca BN

bnP

Page 60: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

PSSMs and pseudocounts

probabilities converted into a log-odds form (usually log2 so the information

can be reported in bits) and placed in the PSSM.

Page 61: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Searching PSSMs

value for the first residue in the sequence occurring in the first column is calculated by searching the PSSM

the value for the residue occurring in each column is calculated

Page 62: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Searching PSSMs

values are added (since they are logarithms) to produce a summed log odds score, S

S can be converted to an odds score using the formula 2S

odds scores for each position can be summed together and normalized to produce a probability of the motif occurring at each location.

Page 63: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Information in PSSMs

Information theory: amount of information contained within each sequence.

No information: amount of uncertainty can be measured as log220 = 4.32 for amino

acids, since there are 20 amino acids. For nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2.

Page 64: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Information in PSSMs

If a column is completely conserved then the uncertainty is 0 – there is only one choice.

two residues occurring with equal probability -- uncertainty to deciding which residue it is.

Page 65: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Measure of Uncertainty

Measured as the entropy

)(

)log(aresidues

acacC pfH

Page 66: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Relative Entropy

. Relative entropy takes into account overall composition of the organism being studied

 

Ba is background frequency of residue a in the organism

)(

2 )/(logaresidues

aacacC bpfR

Page 67: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

PSSM Uncertainty

Uncertainty for whole model is summed over all columns:

allcolumns

cc HH

Page 68: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Sequence Logos

Information in PSSMs can be viewed visually

Sequence logos illustrate information in each column of a motif

height of logo is calculated as the amount by which uncertainty has been decreased

Page 69: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Sequence Logos

Page 70: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Statistical Methods

Commonly used methods for locating motifs:

Expectation-Maximization (EM) Gibbs Sampling

Page 71: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Expectation-Maximization

Begin with set of sequences with an unknown signal in common Signal may be subtle Approximate length of signal must be

given

Randomly assign locations of this motif in each sequence

Page 72: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Expectation-Maximization

Two steps: Expectation Step Maximization Step

Page 73: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Expectation-Maximization

Expectation step Residue Frequencies for each position

calculated Residues not in a motif are background

Frequencies used to determine probability of finding site at any position in a sequence to fit motif model

Page 74: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Maximization Step

Determine location for each sequence that maximally aligns to the motif pattern

Once new motif location found for each sequence, motif pattern is revised in the expectation

E-M continues until solution converges

Page 75: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTCCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTGTCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGAAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTCGGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGCAGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGAGCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCACATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCTTCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGCGCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCCCATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGGGATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAGTCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGACCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGCATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGTAGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCAGCACACACACTTATCCAGTGGTAAATACACATCATTCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGATACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGATGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAGCAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAACTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAAGAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCTTGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACTGGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGTCAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTGCCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCAGGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTGCTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC

Page 76: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Residue Counts

Given motif alignment, count for each location is calculated:

Page 77: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Residue Frequencies

The counts are then converted to frequencies:

Page 78: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Example Maximization Step

Consider the first sequence:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

  There are 41 residues; 41-6+1 = 36

sites to consider

Page 79: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

MEME Software

One of three motif models:

OOPS: One expected occurrence per sequence

ZOOPS: Zero or one expected occurrence per sequence

TCM: Any number of occurrences of the motif

Page 80: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Gibbs Sampling

Similar to E-M algorithm Combines E-M and simulated annealing

Goal: Find most probable pattern by sampling from motif probabilities to maximize ratio of model:background probabilities

Page 81: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Predictive Update Step

random motif start position chosen for all sequences except one

Initial alignment used to calculate residue frequencies for motif and background

similar to the Expectation Step of EM

Page 82: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Sampling Step

ratio of model:background probabilities normalized and weighted

motif start position chosen based on a random sampling with the given weights

Different than E-M algorithm

Page 83: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Gibbs Sampling

process repeated until residue frequencies in each column do not change

The sampling step is then repeated for a different initial random alignment

Sampling allows escape from local maxima

Page 84: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Gibbs Sampling

Dirichlet priors (pseudocounts) are added into the nucleotide counts to improve performance

shifting routine shifts motif a few bases to the left or the right

A range of motif sizes is checked

Page 85: Motif Finding PSSMs Expectation Maximization Gibbs Sampling

Gibbs Sampler Web Interface

http://bayesweb.wadsworth.org/gibbs/gibbs.html