building a dictionary for genomes
DESCRIPTION
Building a dictionary for genomes. By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia. Tal Frank. Topics that will be discussed. Biological background Present the biological problem Show an algorithm that treats this problem statistical mechanics methods - PowerPoint PPT PresentationTRANSCRIPT
Building a dictionary for Building a dictionary for genomesgenomes
By Harmen J. Bussemaker, Hao Li, and Eric By Harmen J. Bussemaker, Hao Li, and Eric D. SiggiaD. Siggia
Tal Frank
Topics that will be Topics that will be discussed discussed
Biological background Biological background
Present the biological problem Present the biological problem
Show an algorithm that treats this problemShow an algorithm that treats this problem statistical mechanics methods statistical mechanics methods
Try our algorithm on two well known problemsTry our algorithm on two well known problems
What we did so farWhat we did so far
Human Genome Project(2001)Human Genome Project(2001)
This article published(2000) :Sequence is not everything This article published(2000) :Sequence is not everything - Lets do some theory- Lets do some theory
Control over gene expression - when, how muchControl over gene expression - when, how much Control element = Regulator = Sequence motif Control element = Regulator = Sequence motif
Genes are working together = Co-regulated genes Genes are working together = Co-regulated genes
The goals of this workThe goals of this work
Identify the Control element Identify the Control element
Where are they located ?Where are they located ?
Identify co-regulated genesIdentify co-regulated genes
Multiple control elementsMultiple control elements
ExampleExample : where are the control elements located? : where are the control elements located?
ConceptsConcepts: directionality , upstream ,in the junk : directionality , upstream ,in the junk
TATACGACGAXXTTCGATTCGA
ExampleExample: co-regulated genes: co-regulated genes
naïve approach :naïve approach :TATACGACGAXXTTTTTATAAAYYATGGCA ATGGCA
experimentally :experimentally :TATACGACGAXXTTTTCGCGAAYYATGGCAATGGCA
To activate set of genes: multiple sequences neededTo activate set of genes: multiple sequences needed
New terminology New terminology
DNA = string of letters DNA = string of letters Control element = word Control element = word Multiple control element = sentences Multiple control element = sentences Genes and junk = background noiseGenes and junk = background noise
Example : S: …Example : S: …GAGAGCGCXXTGTGGGYYGCTT……GCTT…… words words = {GA,TG}= {GA,TG} sentence = GA.TGsentence = GA.TG background = background = genesgenes and junk. and junk.
MobyDick algorithm MobyDick algorithm
decipher a ‘‘text’’ consisting of a long string of decipher a ‘‘text’’ consisting of a long string of letters written in an unknown language.letters written in an unknown language.
Find the words in the textFind the words in the text Find the right spacing Find the right spacing
example : D={A,T,AT} S=ATT example : D={A,T,AT} S=ATT
P1=A.T.TP1=A.T.T
P2=AT.TP2=AT.T
How would you do it ? How would you do it ? 1.Look for repeated substring in the string :1.Look for repeated substring in the string :
{went, to, he}{went, to, he} D (dictionary) D (dictionary)
2.Space the text – ooopps Spacing is not that 2.Space the text – ooopps Spacing is not that
simple. simple.
e.g.– D={A,T,AT} S=ATT e.g.– D={A,T,AT} S=ATT
PP11=A.T.T =A.T.T pp11
PP22=AT.T =AT.T pp22
Tal went to Weizmann this morning. When he arrived he didn’t go to his office, he went to drink a cup of coffee ….
MobyDick Blueprints MobyDick Blueprints
S=TAGATAT
S=TAGATAT
D={T,A,G}
pw ={pA,pT..}
D={A,TA,…}
pw ={pA,pTA.}
1 letter word
Find pw
2 letter word
Find pw
No more optional words stop!
Find spacing S=TA.G.A.TA.T
statistical mechanics in statistical mechanics in order to ?order to ?
1.How does MobyDick decide {p1.How does MobyDick decide {pww}?}?
2.When does MobyDick add a new 2.When does MobyDick add a new
word?word?
3.Space (parse) the text. 3.Space (parse) the text.
The likelihood function The likelihood function
k: a possible spacing k: a possible spacing
NNww: number of times the word w appears: number of times the word w appears
Example :Example : D=(T,AT,A) S=TATA D=(T,AT,A) S=TATA
k1=T.A.T.Ak1=T.A.T.A
k2=T.AT.Ak2=T.AT.A
( )wN k
ww
k
z p
2 2A T A T ATZ p p p p p
Likelihood function - intuitionLikelihood function - intuition
Z(D,{pZ(D,{pww})- partition function: <E>,<N>,<T>,…. })- partition function: <E>,<N>,<T>,….
Z(D,{pZ(D,{pww})- the probability to obtain a })- the probability to obtain a
sequence S.sequence S.
Example :Example : D =(T,AT,A) {p D =(T,AT,A) {pTT,p,pAA,p,pATAT} }
Question : Question : what is the probability to S=TATA? what is the probability to S=TATA?
11stst possibility : T.A.T.A possibility : T.A.T.A p pAA*p*pAA*p*pTT*p*pTT
22ndnd possibility: T.AT.A possibility: T.AT.A ppTT*p*pATAT*p*pAA2 2A T A T ATp p p p p p Z
Finding {pFinding {pww}}
Given : D,S
Maximize Z({pw},D) with respect to {pw}
This {pw} gives the highest probability to get the given S
Lets find the {pLets find the {pww} !} ! Definition Definition : - average number of the word w : - average number of the word w
over the different spacings over the different spacings ..
Can prove:Can prove:
maximize Z-maximize Z- solve: solve: solvingsolving : is done by iteration: : is done by iteration:
''
ww
ww
Np
N
( ) lnw ww
N p Zp
pw’ <Nw’> pw
wN
Enough is enough !!!Enough is enough !!!
When is pWhen is pww good enough ? good enough ?
when the new {pwhen the new {pww} don’t give higher Z} don’t give higher Z
We say : this method converges ! We say : this method converges !
Other methods don’t converge. Other methods don’t converge.
Why finding {pWhy finding {pww} using this } using this way ?way ?
Monte-Carlo methods don’t converge. Monte-Carlo methods don’t converge.
Slow method Slow method can transform to fast method can transform to fast method
Order of complexity O(LDl) Order of complexity O(LDl)
L-the length of the string L-the length of the string
D-the size of the dictionary D-the size of the dictionary
l-the length of the longest word in Dl-the length of the longest word in D
Add new words ?Add new words ?
Compose new word ww’
'[ _ '] [ ]?w wN ww N p
Check occurrence
Look at dictionary D={T,A,C,G} S=TATTGA
D={T,A,C,G} S=TATTGA ww’=TA
D={T,A,C,G} S=TATTGA ww’=TA
Yes- add to dictionary D={T,A,C,G,TA} S=TATTGA
A problem and a bad A problem and a bad solution solution
The algorithm finds only the words which are The algorithm finds only the words which are composed from words already in the dictionary.composed from words already in the dictionary.
Example : S=AATATAAAExample : S=AATATAAA 11stst step : S= step : S=AAAATTAATTAAAAAA D= {A}D= {A} 22ndnd step : S=A step : S=AATATATATAAAAAA AT is not a composition of wordsAT is not a composition of words Solution: Look for repeated long stringsSolution: Look for repeated long strings by consideration the problem by consideration the problem
Spacing Spacing Define :Define : number of times the word w occurs in number of times the word w occurs in
a given spacing.a given spacing.
Quality factor :Quality factor :
The required condition :The required condition :
ww
w
NQ
w
1wQ
checking the algorithmchecking the algorithm
Applying on the English novel Moby DickApplying on the English novel Moby Dick
Applying on Control elements on the yeast Applying on Control elements on the yeast genomegenome
Not always possible - Voynich manuscript Not always possible - Voynich manuscript (1450)(1450)
Preparing the book Preparing the book MobyDick MobyDick
Call me Ishmael. Some years ago- never mind how long precisely- having littleCall me Ishmael. Some years ago- never mind how long precisely- having littleor no money in my purse, and nothing particular tothought I would sail …or no money in my purse, and nothing particular tothought I would sail …
CallmeIshmaelSomeyearsagonevermindhowlongpreciselyhavingliCallmeIshmaelSomeyearsagonevermindhowlongpreciselyhavinglittleornomoneyinmypurseandnothingparticulartothoughtIwouldsail..ttleornomoneyinmypurseandnothingparticulartothoughtIwouldsail..
CallCallabajaabajameIshmaelmeIshmaelbjklmbbbjklmbbSomeyearsagonevermindhowlon Someyearsagonevermindhowlon EciselyhavinglittleEciselyhavinglittlermsdrrmsdrornomoneyinmypurseandnothingparticu ornomoneyinmypurseandnothingparticu artothoughtIwouldsail …artothoughtIwouldsail …
Results- MobyDick Results- MobyDick
10 first chapters 10 first chapters D={a,b,c….}D={a,b,c….} Text : 4,214 unique wordsText : 4,214 unique words 2,630 occurred only once 2,630 occurred only once Background – increases L by the factor of 3.Background – increases L by the factor of 3.
2,450 words found , 700 in English, 40 2,450 words found , 700 in English, 40 composite words. composite words.
Results- yeast Results- yeast
D={T,A,C,G}D={T,A,C,G} Text : 443 experimentally determined sites Text : 443 experimentally determined sites Background – genes and junkBackground – genes and junk
500 words found 500 words found 114 match the experimentally predictions 114 match the experimentally predictions
Not that good – it is a beginning!Not that good – it is a beginning!
The endThe end