special topics in genomics motif analysis. sequence motif – a pattern of nucleotide or amino acid...
DESCRIPTION
Motif representationTRANSCRIPT
Special Topics in Genomics
Motif Analysis
Sequence motif – a pattern of nucleotide or amino acid sequences
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA
TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA
CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA
TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG
AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC
ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG
TF
TF
TF
TF
TF
TF
123456789
TGGGTGGTC
TGGGTGGTA
TGGGAGGTC
TGGGTGGTG
TGAGTGGTC
TGGGTGGTC
Transcription Factor Binding Sites (TFBS)
DNA motif:
Protein motif:
Motif representation
Consensus sequence
Example: CACSTG
Sequence LogoSchneider & Stephens, Nucleic Acids Res. 18:6097-6100 (1990)
Entropy (Shannon) – a measurement of uncertainty
The amount of uncertainty reduced by observing sequences is the amount of information (or information content) we obtained:
This is the height of each position in the logo plot.
Height of each nucleotide is proportional to its frequency
Two questions in motif analysis
• Known motif mapping
Finding occurrences of a motif in nucleotide or amino acid sequences
• De novo motif discovery
Finding motifs that are previously unknown
Known motif mapping
• Consensus mapping
STEP 1: provide a motif (e.g. CACSTG = CAC[C,G]TG)STEP 2: specify number of mismatches allowed (e.g. <=1)STEP 3: scan the sequence
CGCCGGGACCAGATCAACGCCGAGATCCGGCACATGAAGGAGCT m=3, no m=1, yes
A useful tool: CisGenome (http://www.biostat.jhsph.edu/~hji/cisgenome)
Known motif mapping
• Motif matrix mapping (CisGenome)STEP 1: provide a motif and background modelSTEP 2: specify a likelihood ratio cutoff (e.g. LR>=500)STEP 3: scan the sequence
0
GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA
LR>500, yes LR<500, no
Motif:Background:
A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3
1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00
• Another tool for matrix mappingMAST (http://meme.sdsc.edu/meme/mast-intro.html)
De novo motif discovery
• Two major class of methods:
1. Word enumeration
2. Matrix updating
Word enumeration
Example: Sinha & Tompa, Nucleic Acids Res. 30: 5549-5560 (2002)
STEP 1: enumerate possible words;STEP 2: count word occurrences;STEP 3: compare observed word count with random expectation.
Matrix updating
• CONSENSUS (Stormo & Hartzell, PNAS, 86: 1183-1187, 1990)
STEP 1: use all k-mers in the first sequence as seeds;
STEP 2: find matches (often use best matches) of each seed in the second sequence;
STEP 3: update seed matrices, exclude matrices with low information content;
STEP 4: repeat step 2 and 3 for all sequences.
Matrix updating• Mixture model
0 , W
EM:
Lawrence and Reilly (1990)
Bailey and Elkan (1994), etc.
Gibbs Sampler:
Lawrence et al. (1993)
Liu (1994), Liu et al. (1995), etc.
S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA
A: 000000000000001000000000000000000000000001000000000000000000000000000000
Motif:Background:
q = [q0,q1]q0 q1
),,(),,,|,(),|,,,( qWΘθqWΘASθSqWΘA 00 ff
A C G TA .3 .2 .2 .3C .2 .3 .3 .2G .2 .3 .3 .2T .3 .2 .2 .3
1 2 3 4 5 6 7 8 9A 0.00 0.00 0.17 0.00 0.17 0.00 0.00 0.00 0.17C 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66G 0.00 1.00 0.83 1.00 0.00 1.00 1.00 0.00 0.17T 1.00 0.00 0.00 0.00 0.83 0.00 0.00 1.00 0.00
,W,q A
Inference by iterative estimation/sampling
Other issues
• Dependencies within motif
• Functions of novel motifs