Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc

Download Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc

Post on 12-Jan-2016

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Gene prediction</p></li><li><p>Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg</p></li><li><p>Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg</p></li><li><p>Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg</p><p> Gene!</p></li><li><p>Gene Prediction AnalogyNewspaper written in unknown languageCertain pages contain encoded message, say 99 letters on page 7, 30 on page 12 and 63 on page 15. </p><p>How do you recognize the message? You could probably distinguish between the ads and the story (ads contain the $ sign often)</p><p>Statistics-based approach to Gene Prediction tries to make similar distinctions between exons and introns. </p></li><li><p>Noting the differing frequencies of symbols (e.g. %, ., -) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper? Statistical Approach: Metaphor in Unknown Language </p></li><li><p>Two Approaches to Gene Prediction Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.</p></li><li><p>If you could compare the days news in English, side-by-side to the same news in a foreign language, some similarities may become apparentSimilarity-Based Approach: Metaphor in Different Languages</p></li><li><p>Annotation of Genomic SequenceGiven the sequence of an organisms genome, we would like to be able to identify:GenesExon boundaries &amp; splice sitesBeginning and end of translationAlternative splicingsRegulatory elements (e.g. promoters) The only certain way to do this is experimentally, but it is time consuming and expensive. Computational methods can achieve reasonable accuracy quickly, and help direct experimental approaches.primary goalssecondary goals</p></li><li><p>Prokaryotic Gene StructurePromoter CDS TerminatortranscriptionGenomic DNAmRNA Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'. The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.</p></li><li><p>Pieces of a (Eukaryotic) Gene(on the genome)exons (cds &amp; utr) / introns(~ 102-103 bp) (~ 102-105 bp)</p></li><li><p>What is Computational Gene Finding?Given an uncharacterized DNA sequence, find out:</p><p>Which region codes for a protein?Which DNA strand is used to encode the gene?Which reading frame is used in that strand?Where does the gene starts and ends?Where are the exon-intron boundaries in eukaryotes?(optionally) Where are the regulatory sequences for that gene?</p></li><li><p>Prokaryotic Vs. Eukaryotic Gene FindingProkaryotes:</p><p>small genomes 0.5 10106 bphigh coding density (&gt;90%)no introns</p><p>Gene identification relatively easy, with success rate ~ 99%</p><p>Problems:</p><p>overlapping ORFsshort genesfinding TSS and promotersEukaryotes:</p><p>large genomes 107 1010 bplow coding density (</p></li><li><p>What is it about genes that we can measure (and model)?Most of our knowledge is biased towards protein-coding characteristicsORF (Open Reading Frame): a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence.Codon Usage: most frequently measured by CAI (Codon Adaptation Index)Other phenomenaNucleotide frequencies and correlations: value and structureFunctional sites:splice sites, promoters, UTRs, polyadenylation sites</p></li><li><p>General Things to Remember about (Protein-coding) Gene Prediction SoftwareIt is, in general, organism-specific</p><p>It works best on genes that are reasonably similar to something seen previously</p><p>It finds protein coding regions far better than non-coding regions</p><p>In the absence of external (direct) information, alternative forms will not be identified</p><p>It is imperfect! (Its biology, after all)</p></li><li><p>Gene Finding: Different ApproachesSimilarity-based methods (extrinsic) - use similarity to annotated sequences:</p><p>proteinscDNAsESTs</p><p>Comparative genomics - Aligning genomic sequences from different species</p><p>Ab initio gene-finding (intrinsic)</p><p>Integrated approaches</p></li><li><p>Similarity-based methodsBased on sequence conservation due to functional constraints</p><p>Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases</p><p>Will not identify genes that code for proteins not already in databases (can identify ~50% new genes)</p><p>Limits of the regions of similarity not well defined</p></li><li><p>Comparative GenomicsBased on the assumption that coding sequences are more conserved than non-coding</p><p>Two approaches:intra-genomic (gene families)inter-genomic (cross-species)</p><p>Alignment of homologous regions</p><p>Difficult to define limits of higher similarity</p><p>Difficult to find optimal evolutionary distance (pattern of conservation differ between loci)</p></li><li><p>Summary for Extrinsic ApproachesStrengths:</p><p>Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions</p><p>Weaknesses:</p><p>Limited to pre-existing biological dataErrors in databasesDifficult to find limits of similarity</p></li><li><p>Ab initio Gene FindingInput: A DNA string over the alphabet {A,C,G,T}</p><p>Output: An annotation of the string showing for every nucleotide whether it is coding or non-coding</p><p>AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCGAAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCGGene finderUsing only sequence information</p><p>Identifying only coding exons of protein-coding genes (transcription start site, 5 and 3 UTRs are ignored)</p><p>Integrates coding statistics with signal detection</p></li><li><p>A eukaryotic geneThis is the human p53 tumor suppressor gene on chromosome 17.Genscan is one of the most popular gene prediction algorithms.This particular gene lies on the reverse strand.</p></li><li><p>ObservationsGiven (walk, shop, clean) What is the probability of this sequence of observations? (is he really still at home, or did he skip the country)What was the most likely sequence of rainy/sunny days?</p></li><li><p>Signals vs contentsIn gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content.Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sitesExamples of contents: exons, introns, UTRs, promoter regions</p></li><li><p>The CpG island problemMethylation in human genomeCG -&gt; TG happens in most places except start regions of genes and within genes CpG islands = 100-1,000 bases before a gene startsQuestionGiven a long sequence, how would we find the CpG islands in it? </p></li><li><p>PromotersPromoters are DNA segments upstream of transcripts that initiate transcription</p><p>Promoter attracts RNA Polymerase to the transcription start site5Promoter3</p></li><li><p>Splice signals (mice): GT , AG</p></li><li><p>Splice site detection</p></li><li><p>Real splice sitesReal splice sites show some conservation at positions beyond the first two.We can add additional arrows to model these states.weblogo.berkeley.edu</p></li><li><p>Ribosomal Binding Site</p></li><li><p>Prior knowledgeThe translated region must have a length that is a multiple of 3.Some codons are more common than others.Exons are usually shorter than introns.The translated region begins with a start signal and ends with a stop codon.5 splice sites (exon to intron) are usually GT; 3 splice sites (intron to exon) are usually AG.The distribution of nucleotides and dinucleotides is usually different in introns and exons.</p></li><li><p>Gene Prediction and Motifs Upstream regions of genes often contain motifs that can be used for gene prediction</p><p>-10STOP010-35ATGTATACTPribnow BoxTTCCAAGGAGGRibosomal binding siteTranscription start site</p></li><li><p>Positional dependenceIn this data, every time a G appears in position 1, an A appears in position 3.Conversely, an A in position 1 always occurs with a T in position 3.ACTGACTTGCACACTTACTAGCATACTAACTT</p></li><li><p>Example of (Positional) Weight MatrixComputed by measuring the frequency of every element of every position of the site (weight)</p><p>Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score)Disadvantages: cut-off value requiredassumes independence between adjacent basesTACGATTATAATTATAATGATACTTATGATTATGTT</p><p>123456A060340C001010G100300T505016</p></li><li><p>Conditional probabilityWhat is the probability of observing an A at position 2, given that we observed a C at the previous position?</p><p>GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG </p></li><li><p>Conditional probabilityWhat is the probability of observing an A at position 2, given that we observed a C at the previous position?Answer: total number of CAs divided by total number of Cs in position 1.3/11 = 27%Probability of observing CA = 3/18 = 17%.GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG </p></li><li><p>Conditional probabilityWhat is the probability of observing a G at position 3, given that we observed a C at the previous position?</p><p>GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG </p></li><li><p>Conditional probabilityWhat is the probability of observing a G at position 3, given that we observed a C at the previous position?Answer: 9/12 = 75%. </p><p>GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG </p></li><li><p>Promoter Structure in Prokaryotes (E.Coli)Transcription starts at offset 0. Pribnow Box (-10) Gilbert Box (-30) Ribosomal Binding Site (+10)</p></li><li><p>Open Reading Frames (ORFs)Detect potential coding regions by looking at ORFsA genome of length n is comprised of (n/3) codonsStop codons break genome into segments between consecutive Stop codonsThe subsegments of these that start from the Start codon (ATG) are ORFsORFs in different frames may overlapGenomic SequenceOpen reading frameATGTGA</p></li><li><p>Long vs.Short ORFsLong open reading frames may be a gene. At random, we should expect one stop codon every (64/3) ~= 21 codons. However, genes are usually much longer than thisA basic approach is to scan for ORFs whose length exceeds certain threshold. This is nave because some genes (e.g. some neural and immune system genes) are relatively short</p></li><li><p>Testing ORFs: Codon UsageCreate a 64-element hash table and count the frequencies of codons in an ORFAmino acids typically have more than one codon, but in nature certain codons are more in useUneven use of the codons may characterize a real geneThis compensate for pitfalls of the ORF length test</p></li><li>Open Reading Frames in BacteriaWithout introns, look for long open reading frame (start codon ATG, , stop codon TAA, TAG, TGA)Short genes are missed (</li><li><p>Coding StatisticsUnequal usage of codons in the coding regions is a universal feature of the genomes</p><p>uneven usage of amino acids in existing proteinsuneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs)</p><p>We can use this feature to differentiate between coding and non-coding regions of the genome</p><p>Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein</p></li><li><p>Coding StatisticsMany different ones</p><p>codon usagehexamer usageGC contentcompositional bias between codon positionsnucleotide periodicity</p></li><li><p>Codon Usage in Human Genome</p></li><li><p>Codon Usage in Mouse GenomeAA codon /1000 frac Ser TCG 4.31 0.05Ser TCA 11.44 0.14Ser TCT 15.70 0.19Ser TCC 17.92 0.22Ser AGT 12.25 0.15Ser AGC 19.54 0.24</p><p>Pro CCG 6.33 0.11Pro CCA 17.10 0.28Pro CCT 18.31 0.30Pro CCC 18.42 0.31AA codon /1000 frac Leu CTG 39.95 0.40Leu CTA 7.89 0.08Leu CTT 12.97 0.13Leu CTC 20.04 0.20</p><p>Ala GCG 6.72 0.10Ala GCA 15.80 0.23Ala GCT 20.12 0.29Ala GCC 26.51 0.38 </p><p>Gln CAG 34.18 0.75Gln CAA...</p></li></ul>