motif finding workshop project
DESCRIPTION
Motif Finding Workshop Project. Chaim Linhart January 2008. Outline. 1. Some background again… 2. The project. 1. Background. Slides with Ron Shamir and Adi Akavia. Gene: from DNA to protein. Pre-mRNA. Mature mRNA. DNA. protein. transcription. splicing. translation. DNA. - PowerPoint PPT PresentationTRANSCRIPT
1 MF workshop 08 © Ron Shamir
Motif Finding WorkshopProject
Chaim LinhartJanuary 2008
3 MF workshop 08 © Ron Shamir
1. Background
Slides with Ron Shamir and Adi Akavia
4 MF workshop 08 © Ron Shamir
DNA Pre-mRNA protein
transcription translation
Mature
mRNA
splicing
Gene: from DNA to protein
5 MF workshop 08 © Ron Shamir
DNA• DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T }• Resides in chromosomes• Complementary strands: A-T ; C-G Forward/sense strand: AACTTGCG Reverse-complement/anti-sense strand: TTGAACGC• Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA (downstream)5’ end 3’ end
6 MF workshop 08 © Ron Shamir
Gene structure (eukaryotes)
Transcription start site (TSS)
Promoter
Transcription (RNA polymerase)
DNA
Pre-mRNAExon ExonIntron
Splicing (spliceosome)
Mature mRNA
5’ UTR 3’ UTR
Start codon Stop codonCoding region
Translation (ribosome)
Protein
Coding strand
7 MF workshop 08 © Ron Shamir
Translation• Codon - a triplet of bases, codes a specific
amino acid (except the stop codons); many-to-1 relation
• Stop codons - signal termination of the protein synthesis process
http://ntri.tamuk.edu/cell/ribosomes.html
8 MF workshop 08 © Ron Shamir
Genome sequences• Many genomes have been sequences,
including those of viruses, microbes, plants and animals.
• Human: – 23 pairs of chromosomes– 3+ Gbps (bps = base pairs) , only ~3% are
genes– ~25,000 genes
• Yeast:– 16 chromosomes– 20 Mbps– 6,500 genes
9 MF workshop 08 © Ron Shamir
Regulation of Expression
• Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks
• Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition
• Main regulatory mechanism – transcriptional regulation
10 MF workshop 08 © Ron Shamir
•Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs)
•TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS)
•BSs of a particular TF share a common pattern, or motif
•Some TFs operate together – TF modules
TFTFGene5’ 3’
BSBSTSS
Transcriptional regulation
11 MF workshop 08 © Ron Shamir
•Consensus (“degenerate”) string:TFBS motif models
gene 7
gene 9
gene 5
gene 3gene 2
gene 4
gene 6
gene 8
gene 10
gene 1AACTGT
CACTGTCACTCT
CACTGT
AACTGT
AC ACT
CGT
•Statistical models…•Motif logo representation
12 MF workshop 08 © Ron Shamir
Human G2+M cell-cycle genes:The CHR – NF-Y module
CDCA3 (trigger of mitotic entry 1)CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18
CDCA8 (cell division cycle associated 8)TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23
CDC2 (cell division control protein 2 homolog)CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0
CDC42EP4 (cdc42 effector protein 4)GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110
CCNB1 (G2/mitotic-specific cyclin B1)AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45
CCNB2 (G2/mitotic-specific cyclin B2)TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10
BS’s are short, non-specific, hiding in both strands and at various locations along the promotersTFs: NF-Y , CHR
13 MF workshop 08 © Ron Shamir
The computational challenge
• Given a set of co-regulated genes (e.g., from gene expression chips)
• Find a motif that is over-represented (occurs unusually often) in their promoters
• This may be the TF binding site motif• Find TF modules – over-represented
motifs that tend to co-occur
14 MF workshop 08 © Ron Shamir
The computational challenge (II)
• Motifs can also be found w/o a given target-set – “genome-wide”
• Find a motif that is localized - occurs more often neat the TSS of genes
• Find a motif with a strand bias – occurs more often on the genes’ coding strand
• Find TF modules with biases in their order / orientation / distance
15 MF workshop 08 © Ron Shamir
Motif finding algorithms• >100 motif finding algs• Main differences between them:
– Type of analysis & input: • Target-set vs. genome-wide• Single vs. multi-species (conservation)• Single motifs vs. modules
– Motif model– Score for evaluating motif– Motif search technique:
• Combinatorial (enumeration) vs. Statistical optimization
16 MF workshop 08 © Ron Shamir
Over-represented motifs in the promoters of genes expressed in the G2 and G2/M phases of the human cell cycle:
Example - Amadeus
CHR
NF-Y
18 MF workshop 08 © Ron Shamir
General goals• Develop software from A-Z:
– Design– Implementation– (Optimization) – Execution & analysis of real data
• A taste of bioinformatics• Have fun• Get credit…
19 MF workshop 08 © Ron Shamir
The computational task• Given a set of DNA sequences• Find “interesting” pairs of motifs:
– Order bias– Other scores…
• Main challenges:– Performance (time, memory)– Output redundancy
20 MF workshop 08 © Ron Shamir
InputFile with DNA sequences in “fasta” format:
>sequence-name1 <space> [header1]ACCCGNNNNTCGGAAATGANNCGGAGTAAAATATGCGAGCGT>sequence-name2 <space> [header2]cggattnnnaccgcannnnnnnnaccgtga>sequence-name3 <space> [header3]agtttagactgctagctcgatcgctagcggatnggctannnnnatctag
21 MF workshop 08 © Ron Shamir
Input (II)• Ignore the header lines• Sequence may span multiple lines
or one long line• Sequence contains the characters
A,C,G,T,N in upper or lower case• “N” means unknown or masked
base• Sample input files will be supplied
22 MF workshop 08 © Ron Shamir
(don’t count overlaps, e.g. AAAAAA)
Input (III)• Search parameters:
– Length of motifs (between 5-10)– Min. + Max. distance between the motifs:
ACGGATTGATNNNTGGATGCCAT distance=9
– Single vs. two strands search– Min. number of occurrences (hits) of pair:
GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA hit hit hit
– Max. p-value– Additional parameters…
23 MF workshop 08 © Ron Shamir
OutputA. A list of the string pairs with the
best order-bias score (smallest p-values):
Motif A Motif B A→B B→A p-valueACGTT GGATT 97 17 4.3E-15ACGTT GATTC 87 16 2.7E-13
TTAAC CAGCC 31 114 1.2E-12
B. A non-redundant list of motif pairs (motif = consensus string):logos, # of hits, additional scores
24 MF workshop 08 © Ron Shamir
Part A: String pairs with order bias
• nA = # of A→B ; nB = # of B→A• WLOG, nA > nB• n = nA + nB• H0 = random order: nA ~ B(n, 0.5)• p-value = prob for at least nA occurrences
of A→B = tail of B(n, 0.5) • Normal approximation (central limit thm.)• Fix for multiple testing: x2
( , , ) (1 )n
j n j
j k
nBinomial tail n p k p p
j
25 MF workshop 08 © Ron Shamir
• Collect similar strings to motif with better score: (motif = consensus)String pair (p-value) Motif pairACGTT , GGATT (4.3E-15)ACGAT , GGATT (2.4E-11)AGGAT , GGTTT (1.7E-5)AGGTT , GGTTT (5.9E-5)
• Don’t report similar motif pairs:– Motifs that consist of similar strings – Motif pairs that are small shifts of one another– Palindromes
Part B: Non-redundant list
of motif pairs
, (8.1E-31)
26 MF workshop 08 © Ron Shamir
Option I: Co-occurrence rateN = total # of sequencessA = # of sequences that contain motif AsAB = # of sequences that contain motifs A and BH0 = motifs occur independently and randomlyp-value = prob for at least joint occurrences, given the number of hits of each single motif= tail of hypergeometric distribution
Part B (cont.): Additional score
min( , )( , , , )
A B
AB
BB
AA B AB
A
s s
i s
N sss ii
HG tail N s s sNs
27 MF workshop 08 © Ron Shamir
Option II: Distance biasIs the distance between the two motifs uniform (H0), or are there specific distances that are very common?
Option III: Gap variabilityAre the sequences between the motifs conserved (H0),or are they highly variable?
Other options??
Part B (cont.): Additional score
28 MF workshop 08 © Ron Shamir
Implementation• Java (Eclipse) ; Linux• GUI: Simple graphical user interface for
supplying the input parameters and reporting the results
• Packages for motif logo and statistical scores will be supplied
• Time performance will be measured only for part A
• Reasonable documentation• Separate packages for data-structures,
scores, GUI, I/O, etc.
29 MF workshop 08 © Ron Shamir
Design document• Due in 3 weeks (Feb 24)• 3-5 pages (Word), Hebrew/English• Briefly describe main goal, input
and output of program• Describe main data structures,
algorithms, and scores for parts A+B
• Meet with me before submission