motif search

Post on 21-Jan-2016

48 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Motif Search. What are Motifs. Motif (dictionary) A recurrent thematic element, a common theme. Find a common motif in the text. Find a short common motif in the text. Motifs in biological sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Motif Search

What are Motifs

• Motif (dictionary) A recurrent thematic element, a common theme

Find a common motif in the text

Find a short common motif in the text

Motifs in biological sequences

Sequence motifs represent a short common sequence (length 4-20) which is highly represented in the data

Motifs in biological sequences

– Regulatory motifs on DNA or RNA – Functional sites in proteins

What can we learn from these motifs?

Regulatory Motifs on DNA

• Transcription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off)

– TF binding motifs are usually 6 – 20 nucleotides long

– located near target gene, mostly upstream the transcription start site

Transcription Start Site

TF2motif

TF1motif

Gene X

TF1 TF2

What can we learn from these motifs?

About half of all cancer patients have a mutation in a gene called p53 which codes for a key Transcription factors.The mutations are in the DNA binding region and allowstumors to survive and continue growing even after chemotherapy severely damages their DNA

P53 Transcription Factor

Target Gene

Binding sites (moifs)

Why is P53 involved in so many cancer types?

We are interested to identify the genes regulated by p53

p53 regulated over 100 different genes

(hub)

Can we find TF targets using a bioinformatics approach?

Finding TF targets using a bioinformatics approach?

Scenario 1 : Binding motif is known (easier case)

Scenario 2 : Binding motif is unknown (hard case)

Scenario 1 : Binding motif is known

• Given a motif find the binding sites in an input sequence

Challenges in biological sequencesMotifs are usually not exact words

……

.

How to present non exact motifs?

How to present non exact motifs?

• Consensus string NTAHAWT

May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc.

• Position Specific Scoring Matrix (PSSM)

Probability for each base

in each position A

T

GC

1 2 3 4 5 6

0.1 0.7 0.2 0.6 0.5 0.1

0.7 0.1 0.5 0.2 0.2 0.8

0.1 0.1 0.1 0.1 0.1 0.0

0.1 0.1 0.2 0.1 0.1 0.1

Given a consensus :

For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene

>promoter of gene AACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA…….

Given a PSSM:

Seq 1 AAAGCCCSeq 2 CTATCCASeq 3 CTATCCCSeq 4 CTATCCCSeq 5 GTATCCCSeq 6 CTATCCCSeq 7 CTATCCCSeq 8 CTATCCCSeq 9 TTATCTG

Starting from a set of aligned motifs

Given a PSSM:

1 1 9 9 0 0 0 1 A

6 0 0 0 0 9 8 7 C

1 0 0 0 1 0 0 1 G

1 8 0 0 8 0 1 0 T

W

.11 .11 1 1 0 0 0 .11 A

.67 0 0 0 0 1 .89 .78 C

.11 0 0 0 .11 0 0 .11 G

.11 .89 0 0 .89 0 .11 0 T

Counts of each baseIn each column

Probability of each baseIn each column

Wk = probability of base in column k

• Given a string s of length l = 7• s = s1s2…sl

• Pr(s | W) =

• Example: Pr(CTAATCCG) = 0.67 x 0.89 x 1 x 1 x 0.89x 1 x 0.89 x 0.11

k

Wsk k

Given a PSSM:• Given sequence S (e.g., 1000 base-pairs long)• For each substring s of S,

– Compute Pr(s|W)

– If Pr(s|W) > some threshold, call that a binding site

• In DNA sequences we need to search both strands AGTTACACCA

TGGTGTAACT (reverse complement)

Seq1 :AAAACGTGCGTAGCAGTTACACCAACTCTA TTTTGCACGCATCGTCAATGTGGTTGAGAT

Seq2 :ACTTACTACTGGTGTAACTATATATTTTCG TGAATGATGACCACATTGATATATAAAAGC

Scenario 2 : Binding motif is unknown

“Ab initio motif finding”

Ab initio motif finding: Expectation Maximization

• Local search algorithm

- Start from a random PWM– Move from one PWM to another so as to

improve the score which fits the sequence to the motif

– Keep doing this until no more improvement is obtained : Convergence to local optima

Expectation Maximization

• Let W be a PWM . Let S be the input sequence . • Imagine a process that randomly searches,

picks different strings matching W and threads them together to a new PWM

Expectation Maximization

• Find W so as to maximize Pr(S|W)

• The “Expectation-Maximization” (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)

Expectation Maximization

PWMStart from a random motif1.

Scan sequence for good matches to the current motif.

2.

3. Build a new PWM out of these matches, and make it the new motif

The final PSSM represents the motif which is mostly enriched in the data

-A letter’s height indicates the information it contains

The PSSM can be also represented as a sequence logo

Presenting a sequence motif as a logo

TTCACGTACATGTACAGGTACAAG

PSSM

Letter Height

Log2S

1 2 3 4 5 6

A 0 3 0 1 1 0

G 0 0 0 0 1 4

C 0 0 4 0 1 0

T 4 1 0 0 1 0

1 2 3 4 5 6

A 0 0.75 0 1 0.25 0

G 0 0 0 0 0.25 1

C 0 0 1 0 0.25 0

T 1 0.25 0 0 0.25 0

PWM

T position 1=Log24=2T position 5=Log21=0

Divide each score by backgroundprobability 0.25

חידה

מהו המקסימום גובה שנוכל לקבל בלוגו שמתאר •מוטיב שהתקבל מרצפי חלבונים??

Are common motifs the right thing to search for ?

?

Solutions:

-Searching for motifs which are enriched in one set but not in a random set

- Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list

Sequencing the regions in the genome to which a protein (e.g. transcription factor) binds to.

ChIP-Seq

ChIP –SEQ

BestBinders

WeakBinders

Finding the p53 binding motif in a set of p53 target sequences which are ranked according to binding affinity

Ranked sequences list

Candidate k-mers

CTACGC

ACTTGA

ACGTGA

ACGTGC

CTGTGC

CTGTGA

CTGTAC

ATGTGC

ATGTGA

CTATGC

CTGTGC

CTGTGA

CTGTGACTGTGA

CTGTGA

CTGTGA

CTGTGA

- a word search approach to search for enriched motif in a ranked list

CTGTGA

CTGTGA

The total number of input sequences

The number of sequences containing the motif

The number of sequences at

the top of the list

The number of sequences containing the motif among the top sequences

Ranked sequences list

CTGTGA

CTGTGA

CTGTGA

CTGTGA

CTGTGA

CTGTGA

CTGTGA

CTGTGA

uses the minimal hyper geometric statistics (mHG) to find enriched

motifs

The enriched motifs are combined to get a PSSM which represents the binding

motif

P[ED]XK[RW][RK]X[ED]

Protein Motifs

Protein motifs are usually 6-20 amino acids long andcan be represented as a consensus/profile:

or as PWM

top related