Genes Recognition
Julien Favre
Agenda
• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details
• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity
• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 1
2
Gene Definition
What’s a Gene?
PART 13
DNA
Transcription Process I
PART 14
Transcription Process II
PART 15
STEP 1
STEP 2
STEP 3
STEP 1 Transcription
PART 16
ANIMATION
STEP 2 Processing
PART 17
Capping and Poly-A
Splicing
STEP 3 Translation
PART 18
ANIMATION
More details on Genes
PART 19
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
5’ 3’
Coding region Non Coding Region
TATA box
Start CodonEnd Codon
Beginning of the gene
Splice sites
It differs from genes to genes!
Agenda
• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details
• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity
• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 2
10
Situation
• Over 3’500 million of nucleotides• 35’000 -50’000 genes
2 Important Questions:
1) Where are the genes?2) What are the coding parts?
PART 211
Why?
• Annotate and correct the DNA databases• Link genes with the known proteins• Understand the genes functions• Understand genes expression mechanism
PART 212
We can read the DNA alphabet, but we don’t know where are the meaningful words and their meaning.
Complexity I
PART 213
ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA
3’500 Million bases
Complexity II
PART 214
Acceptor SitesDonor Sites Number of parses = Fibonacci(n+m+1)
DNA
Exons Exons
Agenda
• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details
• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity
• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 3
15
Approaches
3 Types of Approaches :
1. Single Gene RecognitionFunctional Signals detection
Splice sitesPromoter, Poly-A, …
2. Multiple Genes Recognition3. Similarities
PART 316
Single Gene RecognitionPrinciple
Functional Signals DetectionMain goal is to detect the beginning and the
end of the exons or genes
PART 317
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
5’ 3’TATA box
Start CodonEnd Codon
Splice sites
Splicing Mechanism
PART 318
Time™ and a) decompressorsee this picture.DNA
• Consensus over the donor-acceptor site GU-AG (98%)
• Extremely reliable technique to detect exons
Single Gene RecognitionMethods
PART 319
1. Combinatorial methods– Single block
2. Probabilistic methods– Simple – Markov based
3. Linear Discriminant methods
Consensus Sequence
PART 320
Obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest
TACGATTATAATTATAATGATACTTATGATTATGTT
Consensus Sequence TATAAT
MELONMANGOHONEYSWEETCOOKY
MONEYLeads to loss of information and can produce many false positive or false negative predictions
Combinatory Methods
PART 321
Consensus Sequence(ex: TATA box)For a consensus sequence of size L and for a position in the considered sequence, we compute
1. P(L,k)= P(Detect the consensus seq. with k mismatches)2. where Fl = #possible positions in
the considered sequence and T is the number of patterns detected in the given sequence
3. For a given To, define a threshold value for the detection.
P(T) = CFlT pT (L,z)(1− P(L,z))Fl−T
z= 0
k
∑
Probabilistic Methods I
PART 322
For a given consensus sequence a Weight Matrix is computed:Computed by measuring the frequency of every element of a particular position of the base in a training set:
Matrix entries can be considered as probabilitiesDisadvantages:
– assumes independence between adjacent bases
GU Acceptor site
Probabilistic Methods II
PART 323
• Under the weight matrix model, the probability of having a sequence (x1, x2, .., xk) that matches a site is:
If we introduce a measure of the form :
Then, the more LLR exceeds 0, the better chances this sequence is a functional signal
P(X /S) = pxi
i
i=1
k
∏
LLR(X) = Log( P(X /S)P(X /N)
)
Methods improvements
PART 324
2 blocks approach P(L1,k1,L2,k2) and distance D1Multiple nucleotides probabilitiesNeuronal network approachReading frameMarkov Models
Markov Models
PART 325
Probabilistic method are 0-order Markov modelsMarkov introduces dependencies between the basesThe probabilities of observing a sequence becomes now:
P(X /S) = p0 pxi
i−1,i
i=1
k
∏
Linear Discriminant methods I
PART 326
Many functional signals are very short => Exploit related characteristics1. We build a sequence characteristics vector
(x1, …,xp)2. We define and if Z>c then the sequence
correspond to a site3. We use a training set to define {ai}, c4. The training set of « site sequences » define a
vector m1 and the « non site sequence » a vector m2
Z = aixii= 0
p
∑
a = s−1(m1 −m2) c = a (m1 + m2) /2
Linear Discriminant methods II
PART 327
1. Choose a set of p characteristics– Score of the weight matrix– Distance to a predicted site– Base composition in distant sequence– …
2. Test the characteristics with the Mahalonodisdistance:
3. Choose the set of q characteristics that maximizes D2
D2 = (m1 −m2)s−1(m1 −m2)
Linear Discriminant methods IIIExample
PART 328
Poly-A site
Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)
Stop codon
T-rich
Poly-A site
GT-Rich Last Exon
5’Score of Poly-A
CAATAAA(T/C)
Distance between poly-A and GT
Score of GT
Linear Discriminant methods IVExample
PART 329
Poly-A site
Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)
12.6812.3611.6710.787.61Composed D2
0.442.270.013.467.61Individual D2
35241MahalonodisDistance
Multiple Genes approach
PART 330
2 Approaches:
1. Discriminant Analysis, Pattern based– FGENES
2. HMM, Probabilistic approach– FGENEH
Discriminant Analysis
PART 331
Goal: Detect first and last Exons in a big sequence
1. Find internal exons
2. Find last exons based on 3’ sites
3. Find first exons based on 5’ sites
4. Combine results
AInternal exonIntron Intron
D
ALast exonIntron 3’ site
Stop
ATGFirst exon5’ site Intron
D
HMM method I
PART 332
We want to use Markov model to represent and recognize genes
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
HMM method II
PART 333
Real model:E0 E1 E2
E2E1E0
NP
Eterm
P
Einit
polyA
5’ UTR
I0 I1 I2
I0 I1 I2
Esngl
Esngl
Einit Eterm3’ UTR
5’ UTR 3’ UTR
polyA
HMM method III
PART 334
1. The model must be trained to compute:• State transition probabilities• Initial distribution
2. For a given sequence, we look for the best path using Vitterbi algorithm
3. We analyze the best path to determine if it could be a gene.
Similarity methods
PART 335
2 Goals:1. Find out the genes functions2. Improve algorithms
2 Main Methods:1. EST based2. BLAST with others species
Remarks
PART 336
• Real challenge is gene recognition in long and complex sequences
• It’s very difficult to measure methods accuracy
• The databases are full of errors
Conclusion
PART 337
• Best results are obtained in combining methods– HMM + EST+Dynamic programming
• This problem will be solved within a few years• But huge challenges are remaining
– Gene regulation – Alternative splicing– Gene expression
Questions And Remarks
PART 338
Thanks for your attention