bioinformatics gene detection and prediction gene predictions in prokaryotes gene predictions in...

Download Bioinformatics Gene detection and prediction Gene predictions in prokaryotes Gene predictions in eukaryotes Difficulties of gene prediction Statistical

Post on 21-Dec-2015

215 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Bioinformatics Gene detection and prediction Gene predictions in prokaryotes Gene predictions in eukaryotes Difficulties of gene prediction Statistical measure of prediction accuracy Lecture 11
  • Slide 2
  • Gene detection in genomic DNA was a challenging question since early 1980 Intensive sequencing of large genomic DNA sequences and entire genomes made gene prediction even more important in the last few years Ever growing number of programs dealing with gene prediction is the best demonstration of significant efforts On the other hand this is also an indication of significant difficulties on the way caused by the tremendous variety of the genes signals The accuracy of modern gene prediction programs varies broadly form 0.45 to 0.9 depending on gene type, species, software, etc. Gene detection and prediction
  • Slide 3
  • Gene prediction in prokaryotic organisms is generally easier than in eukaryotic mainly because their genes usually lack introns Another reason is more conserved structure of the promoter region Thus having smaller number of patterns, which have to be recognised, and much more conservative nature of these patterns plus a single ORF makes the objective more achievable As a result gene prediction in eukaryotes more reliable Gene prediction in prokaryotic genomes
  • Slide 4
  • Slide 5
  • HMM of an E.coli gene This model is based in a simple assumption that codons are independent, which allows to implement Markov process. However more complex assumptions are possible and such programs exist. Round match state Diagonal insert state Square delete state Model trained on E.coli is capable of finding genes in unknown sequence of E.coli. If a sequence were absolutely accurate, only match states are needed in the model. The insert and delete states allow an ORF with an extra or missing base to be recognised.
  • Slide 6
  • Knowledge of basic structural elements of genes opens a possibility for prediction genes in long stretches of unknown DNA. There two major types of information, which are used in ab initio gene prediction programs. Prediction of numerous signals (next slides) Prediction of exons/ORF Ab initio gene prediction in eukaryotes
  • Slide 7
  • Eukaryotic gene organization and signals 5UTR3UTRCDS Promoter transcriptional initiation site translational start site translational stop site poly(A) site transcriptional termination site 53 Gene mRNA transcript Cap AAA...AA tail This gene contains 6 exons (3 of which can be translated) and numerous signals/motifs, including ORF, splicing signals, promoter sites as well as sites responsible for initiation and termination of transcription and translation. There are other essential signals not shown at this figure. Nearly each of these signals may vary significantly in different genes and species.
  • Slide 8
  • Promoter area Transcriptional initiation site ORFs Splicing signals Translational initiation site (start codon ATG) Translational stop site (a stop codon) Poly(A) site Transcriptional termination site Some of these site are very conservative, while others vary significantly None of the above features alone provide sufficient information to predict a gene in a sea of non-coding sequences Ab initio gene prediction: features requiring recognition in eukaryotic genes
  • Slide 9
  • Prediction of ORF alone is not always highly reliable because only long ORF can be more easily statistically discriminated from non-ORF, which do not contain stop codons by a chance. False-positive predictions for relatively short sequences become a significant issue. I used 2 different programs to predict number of exons for two genes with known structure (3 and 14 exons respectively). In both cases predicted numbers of exons were higher than real (10-15 and 18-21 respectively). As frequencies of trinucleotides TAA, TAG and TGA serving as stop codons are rather low, there is a high probability that long stretches of non-coding DNA may not contain these short combinations. This may be considered as a lack of stop codons and false identification of the sequence as an exon can be made. False negative predictions also can be an issue particularly when exons are very short < 15 nucleotides. Difficulties of ORF finding in eukaryotes
  • Slide 10
  • Accuracy of prediction is higher in more sophisticated programs, which assume that codons may not be independent, as codons may carry much more information that just coding a single amino acid. Such programs use values of codon frequencies, which are not the same in different species (next slide). Frequencies of codons in particular species reflect a long evolutionary history and carry a lot of hidden information about structure of exons and genes, including periodicity of DNA in exons, introns phase distribution and exons symmetricity. The Codon Usage DataBase can be found at: http://www.kazusa.or.jp/codon/ ORF predictions and codon usage tables
  • Slide 11
  • Codon usage is species specific AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT 0.038542 0.018477 0.024862 0.030977 0.020422 0.010144 0.00865 0.019005 0.014801 0.008023 0.003672 0.012206 0.009912 0.01886 0.026516 0.033305 CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT 0.027238 0.00911 0.013719 0.01422 0.025709 0.004416 0.009235 0.008971 0.011919 0.005042 0.00449 0.011227 0.008125 0.01475 0.011957 0.021641 GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT 0.040421 0.016845 0.020172 0.035479 0.02024 0.012092 0.007895 0.022095 0.03102 0.006455 0.004298 0.010906 0.010104 0.013357 0.014256 0.024388 TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT 0.001136 0.014041 0.000416 0.018212 0.020871 0.01049 0.011826 0.016985 0.000826 0.009123 0.011081 0.011749 0.010399 0.024397 0.020223 0.025005 Codon usage frequencies for C. elegans of codon frequencies = 1
  • Slide 12
  • Gene > 45328_AB000381 protein_id:BAA19961.1; Homo sapiens DNA for GPI-anchored molecule-like protein, complete; intron(phase:11,size:609,5302,intr_sum:5911); exon(size:73,108,296,ex_sum:477); {splice:gtag,gtag} ATGCTCCTCTTTGCCTTACTCCTAGCCATGGAGCTCCCATTGGTGGCAGCCAGTGCCACCATGCGCGCTCAGTgtaagtatcattccctctcactgtcctggagaggacgagaattccacctggggtgctgggggtcac tgggatgattggctgcaacgtggagcaagcctccgttagctggggcctgcattgtctgtgtaatcaggggtgggcactagggcagtccaggagtagtcatgagcaaggagagggttaggatgaaggagcagctgaccag ggaccaagggggaaccttgatgtggcccttccccatcagcgccaggcaggaggggctctgtccagggaaacccaggaggatggcggacccctgtgagtatccagtcttccttggcgaggtgagccaggtctgcagagca tagcaatcccgtatgtgaccaccaagtggcgctctctggagcctgcgttggagagcagggaaagctctccttgtgcctggcctccctcccaggagctagcctgggccagactcagactgcatagagagctgagctgtgc aggctaggagaagtccttggaagcagaggggaagggctggccgctgaagaagggtggagtgagctggtaatgggtggaaaaggcgtagtggagcagaagcctgaagcctgctttctcccctctcagGGACTTACAGTTT GAGATGCCATGACTGTGCGGTCATAAATGACTTCAACTGTCCCAACATTAGAGTATGTCCGTATCATATTAGGCGCTGTATGACAATCTCCATTCgtaagtacctcttggtcatttggacacattgtagattagtcccc tacctgggtagtttctggggccagggccagtctgctttcttctctgaacccagctctgtttccccttccctcatgtcctcccatcctgagtgcgtttctgcacgcttgggtctcagcctcatgataggccagcatgcat catcttgtggagccaggtactctgcaaagtagtacagtctgtccacacatctgcagcgtctccagggggtgggagcattgttggaccgcagagcctctactgtccctggttgtgtgtgtagacaccccatgctgcctgg gtgagtcctcactggccattcaatttctggtagtttagagagtgcttccagttcggaaggtcaaagaagtggttgggccatggcactccccccagggaaaaacctaccacacacagttgggagacaagcatgaggtcag ggcacagcctgcacttccaggatgctcttgctttttcctcagagccctctggtgccctctctccccagggccctaacctgcagaagatgtatggccagaggccagtgaccaatggagcaagcagggagggtgcagccag tgtatgccctggcgcacaggtggagcctcgtctggggctctgctcagggcctttcctgtagcccttttgtccccggagtctgtggttgtgcttgcaacttcacatgtcattctgtgttctaagctttgtgcagaaataa tcccctgattaccacctcggtgtggttgggttatgtccccacccagatctcatcttgaattgtaactcccataatccccgtgtgtcatgggagggacctggtgggaggtaattgattcatctgggtggttaccctcatg atgtgctcgtgatagcgagtgcccacaagggcgaaacctgccacacacagatgggccggacaagcatgcggtgcagggcacagactgcacttccaggatgctcttgctttttcctcagagcccgctggtgccctctctc cccagggccctaacctgcagaagatctgatggttttctaaggagggttttccccggaacttctcctgcctgccatcattcgaagaatgtgttggttacccctgctgttgtcatcgtaagtttcctgaggcctccccagc cacgcggaactgtgagtcagttaaaccgcttgcctttataaattgcccagtctcgggcagttcttatagcagcgtgagaagagatcatacacacctcttcttcagaccaactgctgatttgggacatgggcaggaggct gagaatctgggctttgtctccacagtggccactgctggggaccttgatggcatggttttgtcatgttgggattactgcctctaaatgaggctgagaactatctttcccataactccctcgtggggtaatagttggcaaa aagaggaacttgagtgagattcgcaaggtatgggtgaagcagagatcactcttctggaaggtcacgatggtttagtgaagtgaaggacactggtggactccaatctgtcctccttcatgtccaccccatcttctccact ttgtgagcctttgactgaaagcaattctaagactccagcaggcatttggccactggcctgcctaggtgatgggttctccagtggcgcaggttaaatgtctctgcaagagactcctctgcacagccttacttgagaggat aacggtgcatggcctttgatattgccctgcagagttgggctgtgccagcctgccccagtgtaagtgacctagctctggactgttgctcctctgatcttcagccttacctgactgacttcctctcttggctcccccatgg tcatggcagctgcaacacttttatttaataacttagacctagactgttttagaggctccattttcctgaatgaaccctgatggatattaggacaaagaaattgggaatgctggaatgctgggacatttttcctttcagg agatttgctgatttctggggctgtgcaggatggtaagcaaaagacctctatggaaagaaacgcaggcagtgtcctgagacaaggggcctgggtatgaggcatataggaactctgcactatctctgcaacttttctgaaa acctaaaactatttttaaaataaaaggttcatacacacacacacacacacacacacacacacacaccataaccccactaccttgtggattcaataacagaattgcaatgtggtttacaatttgaaacccatcagtttaa ttcaccatattaacaaagaaaagggcagagatgattgtagatgctgaaatgttttgataaaatccaacaccctttccaaatctctgagggaaatgtctaccatcaccacttctattcccaatggaactggagttgttag gtagttcagttaggcgagggggagtattaagtctcatggagcggaggaagcaagaatttaacctgcttactattcatnagagacaaccaacttcgaataacaaatatctttaggcaagctcctggagacaagataagta ttccgaaatcagtttaccaaatcagttctccttttgtacatcaagaaacagattggaaatgaaatccgaggcatggtgctctttaaagtatngccaaatcaagtaccacaaatctagctggagttgtgcaagatttcaa cactcaacctgcaagaacactattgaagacttaaaacaccattaaagacttaaacacatggaattgt

Recommended

View more >