motif search

36
Motif Search

Upload: tamanna-darshan

Post on 02-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Motif Search. What are Motifs. Motif (dictionary) A recurrent thematic element, a common theme. Find a common motif in the text. Find a short common motif in the text. Motifs in biological sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Motif Search

Motif Search

Page 2: Motif Search

What are Motifs

• Motif (dictionary) A recurrent thematic element, a common theme

Page 3: Motif Search

Find a common motif in the text

Page 4: Motif Search

Find a short common motif in the text

Page 5: Motif Search

Motifs in biological sequences

Sequence motifs represent a short common sequence (length 4-20) which is highly represented in the data

Page 6: Motif Search

Challenges in biological sequencesMotifs are usually not exact words

Page 7: Motif Search

How to present non exact motifs?

• Consensus string NTAHAWT

May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc.

• Position Weight Matrix (PWM)

Probability for each base

in each position A

T

GC

1 2 3 4 5 6

0.1 0.7 0.2 0.6 0.5 0.1

0.7 0.1 0.5 0.2 0.2 0.8

0.1 0.1 0.1 0.1 0.1 0.0

0.1 0.1 0.2 0.1 0.1 0.1

Page 8: Motif Search

Motifs in biological sequences

– Regulatory motifs in DNA (transcription factor binding sites)

– Functional site in proteins (Phosphorylation site)

What can we learn from these motifs?

Page 9: Motif Search

DNA Regulatory Motifs

• Transcription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off)

– TF binding motifs are usually 6 – 20 nucleotides long

– located near target gene, mostly upstream the transcription start site

Transcription Start Site

TF2motif

TF1motif

Gene X

TF1 TF2

Page 10: Motif Search

Can we find TF targets using a bioinformatics approach?

Page 11: Motif Search

P53 is a transcription factorinvolved in most human cancers

We are interested to identify the genes regulated by p53

Page 12: Motif Search

Finding TF targets using a bioinformatics approach?

Scenario 1 : Binding motif is known (easier case)

Scenario 2 : Binding motif is unknown (hard case)

Page 13: Motif Search

Scenario 1 : Binding motif is known

• Given a motif (e.g., consensus string, or weight matrix), find the binding sites in an input sequence

Page 14: Motif Search

Given a consensus :

For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene

>promoter of gene AACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA…….

Page 15: Motif Search

Given a Position Weight Matrix (PWM):

Seq 1 AAAGCCCSeq 2 CTATCCASeq 3 CTATCCCSeq 4 CTATCCCSeq 5 GTATCCCSeq 6 CTATCCCSeq 7 CTATCCCSeq 8 CTATCCCSeq 9 TTATCTG

Starting from a set of aligned motifs

Page 16: Motif Search

Given a Position Weight Matrix (PWM):

1 1 9 9 0 0 0 1 A

6 0 0 0 0 9 8 7 C

1 0 0 0 1 0 0 1 G

1 8 0 0 8 0 1 0 T

W

.11 .11 1 1 0 0 0 .11 A

.67 0 0 0 0 1 .89 .78 C

.11 0 0 0 .11 0 0 .11 G

.11 .89 0 0 .89 0 .11 0 T

Counts of each baseIn each column

Probability of each baseIn each column

Wk = probability of base in column k

• Given a string s of length l = 7• s = s1s2…sl

• Pr(s | W) =

• Example: Pr(CTAATCCG) = 0.67 x 0.89 x 1 x 1 x 0.89x 1 x 0.89 x 0.11

k

Wsk k

Page 17: Motif Search

Given a Position Weight Matrix (PWM)

• Given sequence S (e.g., 1000 base-pairs long)• For each substring s of S,

– Compute Pr(s|W)

– If Pr(s|W) > some threshold, call that a binding site

• In DNA sequences we need to search both strands AGTTACACCA

TGGTGTAACT (reverse complement)

Page 18: Motif Search

Scenario 2 : Binding motif is unknown

“Ab initio motif finding”

Page 19: Motif Search

Ab initio motif finding: Expectation Maximization

• Local search algorithm

- Start from a random PWM– Move from one PWM to another so as to

improve the score which fits the sequence to the motif

– Keep doing this until no more improvement is obtained : Convergence to local optima

Page 20: Motif Search

Expectation Maximization

• Let W be a PWM . Let S be the input sequence . • Imagine a process that randomly searches,

picks different strings matching W and threads them together to a new PWM

Page 21: Motif Search

Expectation Maximization

• Find W so as to maximize Pr(S|W)

• The “Expectation-Maximization” (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)

Page 22: Motif Search

Expectation Maximization

PWMStart from a random motif1.

Scan sequence for good matches to the current motif.

2.

3. Build a new PWM out of these matches, and make it the new motif

Page 23: Motif Search

The final PWM represents the motif which is mostly enriched in the data

-A letter’s height indicates the information it contains -The top letter at each position can be read to obtain the consensus sequence (motif)

The PWM can be also represented as a sequence logo

Page 24: Motif Search

Are common motifs the right thing to search for ?

Page 25: Motif Search

?

Page 26: Motif Search

Solutions:

-Searching for motifs which are enriched in one set but not in a random set

- Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list

Page 27: Motif Search

Searching for enriched motifs in a ranked list

1

234

Bin

ding

aff

init

y

k= number of motifs in the top of the listm= number of sequences in the top of the list

n= number of total motifs foundN= total number of sequences

The P reflects the surprise of seeing the observed density of motif occurrences at the top of the list compared to the rest of the list.

Hyper Geometric (HG) Distribution test

Page 28: Motif Search

Searching for enriched motifs in ranked list

1

234

Bin

ding

aff

init

y

k= number of motifs in the top of the listm= number of sequences in the top of the list

n= number of total motifs foundN= total number of sequences

Choosing the best way to cut the list (minimal HG score)

Page 29: Motif Search

Finding the p53 binding motif in a set of p53 target sequences which are ranked according

to binding affinity >affinity = 5.962ACAAAAGCGUGAACACUUCCACAUGAAAUUCGUUUUUUGUCCUUUUUUUUCUCUUCUUUUUCUCUCCUGUUUCU>affinity = 5.937AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACGCCCGAAAGUUGGACAUUUUAAAUUUUAAUUCUCAUGA> affinity = 5.763UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUCUUUUUUUAAAAAUAAAAAAAGAGGAGAAAAAUGC>affinity = 5.498GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUCAUAUAUACGAUACAAAAAUAACA...

http://drimust.technion.ac.il/

Page 30: Motif Search

P[ED]XK[RW][RK]X[ED]

Protein Motifs

Protein motifs are usually 6-20 amino acids long andcan be represented as a consensus/profile:

or as PWM

Page 31: Motif Search

Protein Domains• In additional to protein short motifs, proteins are

characterized by Domains. • Domains are long motifs (30-100 aa) and are

considered as the building blocks of proteins (evolutionary modules).

The zinc-finger domain

Page 32: Motif Search

Some domains can be found in many proteins with different functions:

Page 33: Motif Search

….while other domains are only found in proteins with a certain function…..

MBD= Methylated DNA Binding Domain

Page 34: Motif Search

Varieties of protein domains

Page 228

Extending along the length of a protein

Occupying a subset of a protein sequence

Occurring one or more times

Page 35: Motif Search

Pfam

> Database that contains a large collection of multiple sequence alignments of protein domains

Based on Profile hidden Markov Models (HMMs).

HMM in comparison to PWM is a modelwhich considers dependencies between thedifferent columns in the matrix (different residues) and is thus much more powerful!!!!

http://pfam.sanger.ac.uk/

Page 36: Motif Search

Profile HMM (Hidden Markov Model)can accurately represent a MSA

D16 D17 D18 D19

M16 M17 M18 M19

I16 I19I18I17

100%

100% 100%

100%

D 0.8S 0.2

P 0.4R 0.6

T 1.0 R 0.4S 0.6

X XX X

50%

50%D R T RD R T SS - - SS P T RD R T RD P T SD - - SD - - SD - - SD - - R

16 17 18 19

Match

delete

insert