Transcript
Page 1: Gene finding and gene structure prediction

Gene finding and genestructure prediction

M. Fatih BÜYÜKAKÇALI2008639500

Computational Bioinformatics2012

Page 2: Gene finding and gene structure prediction

Outline

• Introduction to genes and proteins

• Genetic code

• Open reading frames

Page 3: Gene finding and gene structure prediction

Outline

• Ab initio methods• Principles: signal detection and coding

statistics• Methods to integrate signal detection and

coding statistics• Examples of software

Page 4: Gene finding and gene structure prediction

Outline

• Homology methods• Principles• An overview of the homology methods

Page 5: Gene finding and gene structure prediction

Introduction to genes and proteins• Proteins are the main building block for

many tasks in living organisms. They are themselves build up as a chain of amino-acids (AA) (200-300, typically).

• The chain of amino-acids of a protein is produced by translation of an RNA sequence (via the ribosome; while translation takes place, the protein folds progressively to take its three-dimensional structure).

Page 6: Gene finding and gene structure prediction

Introduction to genes and proteins• The RNA sequence needed to

produce a given protein is normally obtained by transcribing a part of the DNA contained in the genome (it is then called mRNA) and the corresponding subsequence of DNA is called a gene coding for that protein.

Page 7: Gene finding and gene structure prediction

Introduction to genes and proteins• A given genome can contain as few as

500 genes or as many as 30,000 genes.

• Central dogma: DNARNAProtein

Page 8: Gene finding and gene structure prediction

Introduction: gene structure

Page 9: Gene finding and gene structure prediction

Genetic code• Correspondence between tri-mers

(codons) of nucleotides and amino-acids

• 20 amino-acids, but 64 codons (see book, or Internet, for explanations)

• Some amino-acids correspond to several codons: A (Alanine) corresponds to GCA, GCG, GCT

Page 10: Gene finding and gene structure prediction

Genetic code• Some codons do not correspond to an

amino-acid: TAA, TAG, TGA (these are stop codons, see below).

• One codon is special: ATG, it is the sole codon corresponding to Methionine, and is also called start codon (see below).

NB. Although it is RNA that is translated into amino-acids, we use the DNA alphabet (T instead of U) to describe the genetic code, because we will directly search DNA sequences for protein coding sequences.

Page 11: Gene finding and gene structure prediction

Open reading framesAn open reading frame, is a sequence of DNA nucleotides that could be translated into a protein. We know that:

• Translation goes from 5’ to 3’ end of a strain (sense, or anti-sense)

• Translation always starts with a methionine codon (ATG)

Page 12: Gene finding and gene structure prediction

Open reading frames

Page 13: Gene finding and gene structure prediction

Open reading frames

• Translation always stops, as soon as a stop codon is found (and the AA-sequence ends with the AA corresponding to the last non-stop codon).

Page 14: Gene finding and gene structure prediction

What is gene finding?

• From a genomic DNA sequence we want to predict the regions that will encode for a protein: the genes.

• Gene finding is about detecting these coding regions and infer the gene structure starting from genomic DNA sequences.

Page 15: Gene finding and gene structure prediction

What is gene finding?

• We need to distinguish coding from non-coding regions using properties specific to each type of DNA region.

• Gene finding is not an easy task!

Page 16: Gene finding and gene structure prediction

What is gene finding?• Gene finding is not an easy task!

• DNA sequence signals have low information content (small alphabet and short sequences);

• It is difficult to discriminate real signals from noise (degenerated and highly unspecific signals);

• Gene structure can be complex (sparse exons, alternative splicing, ...);

• DNA signals may vary in different organisms;• Sequencing errors (frame shifts, ...).

Page 17: Gene finding and gene structure prediction

Gene structure in prokaryotes

• High gene density and simple gene structure.

• Short genes have little information.• Overlapping genes.

Page 18: Gene finding and gene structure prediction

Gene structure in eukaryotes

• Low gene density and complex gene structure.

• Alternative splicing.• Pseudo-genes.

Page 19: Gene finding and gene structure prediction

Gene finding strategies

• Ab initio methods:• Based on statistical signals within the DNA:

• Signals: short DNA motifs (promoters, start/stop codons, splice sites, ...)

• Coding statistics: nucleotide compositional bias in coding and non-coding regions

Page 20: Gene finding and gene structure prediction

Gene finding strategies

• Strengths:• easy to run and fast execution time• only require the DNA sequence as input

Page 21: Gene finding and gene structure prediction

Gene finding strategies

• Weaknesses:• prior knowledge is required (training sets)• high number of mispredicted gene structures

Page 22: Gene finding and gene structure prediction

Gene finding strategies

• Homology methods:• Gene structure is deduced using

homologous sequences (EST, mRNA, protein).

• Very accurate results when using homologous sequences with high similarity.

Page 23: Gene finding and gene structure prediction

Gene finding strategies

• Strengths:• accurate

• Weaknesses:• need of good homologous sequences• execution is slow

Page 24: Gene finding and gene structure prediction

Gene finding:

Ab initio methods

Page 25: Gene finding and gene structure prediction

Ab initio methods: a simple view

Page 26: Gene finding and gene structure prediction

Methods for signal detection

• Detect short DNA motifs (promoters, start/stop codons, splice sites, intron branching point, ...).

Page 27: Gene finding and gene structure prediction

Methods for signal detection• A number of methods are used for

signal detection:• Consensus string: based on most frequently

observed residues at a given position.

• Pattern recognition: flexible consensus strings.

• Weight matrices: based on observed frequencies of residues at a given position. Uses standard alignment algorithms.

Page 28: Gene finding and gene structure prediction

Methods for signal detection• A number of methods are used for

signal detection:• Weight array matrices: weight matrices based on

dinucleotides frequencies. Takes into account the non-independence of adjacent positions in the sites.

• Maximal dependence decomposition (MDD): MDD generates a model which captures significant dependencies between non-adjacent as well adjacent positions, starting from an aligned set of signals.

Page 29: Gene finding and gene structure prediction

Methods for signal detection• Methods for signal detection:

• Hidden Markov Models (HMMs):• HMMs use a probabilistic framework to infer

the probability that a sequence correspond to a real signal.

• Neural Networks (NNs):• NNs are trained with positive and negative

examples. NNs ”discover” the features that distinguish the two sets.

Page 30: Gene finding and gene structure prediction

Methods for signal detection

Page 31: Gene finding and gene structure prediction

Signal detection limitations• Problems with signal detection:

• DNA sequence signals have low information content.

• Signals are highly unspecific and degenerated.• Difficult to distinguish between true and false

positive.

• How to improve signal detection:• Take context into consideration (ex. acceptor site

must be flanked by an intron and an exon).• Combine with coding statistics (compositional

bias).

Page 32: Gene finding and gene structure prediction

Types of coding statistics• Inter-genic regions, introns, and exons

have different nucleotides contents.• This compositional differences can be

used to infer gene structure.• Examples of coding statistics:

• ORF length:• Assuming an uniform random distribution,

stop codons are present every 64/3 codons (≈ 21 codons) in average.

• In coding regions stop codon average decrease

Page 33: Gene finding and gene structure prediction

Types of coding statistics

• This measure is sensitive to frame shift errors.• Can’t detect short coding regions

• Bias in nucleotide content in coding regions:• Generally coding regions are G+C rich.• There are exceptions! For example coding

regions of P. falciparum are A+T rich.

Page 34: Gene finding and gene structure prediction

Integrating signal and compositional information for gene structure prediction

• A number of methods exists for gene structure prediction which integrate different techniques to detect signals (splicing sites, promoters, etc.) and coding statistics.

• All these methods are classifiers based on machine learning theory.

• Training sets are required to train the algorithms.

Page 35: Gene finding and gene structure prediction

Ab initio methods: Generalized HMMs

Page 36: Gene finding and gene structure prediction

Ab initio methods: Generalized HMMs

Page 37: Gene finding and gene structure prediction

The Other Ab initio methods

• GENSCAN• HMMgene• Linear and quadratic discrimination analysis• FGENES• MZEF• Decision trees• Neural network• GRAIL

Page 38: Gene finding and gene structure prediction
Page 39: Gene finding and gene structure prediction
Page 40: Gene finding and gene structure prediction

Gene finding:

Homology methods

Page 41: Gene finding and gene structure prediction

Homology methods: a simple view

Page 42: Gene finding and gene structure prediction

Homology methods: Procrustes

Procrustes: robber who altered his victims to fit his bed by stretching them or cutting off their legs (Classical Mythology)

Page 43: Gene finding and gene structure prediction

Homology methods: Procrustes

Page 44: Gene finding and gene structure prediction

Homology methods: Genewise

• Uses HMMs to compare DNA sequences to protein sequences at the level of its conceptual translation, regardless of sequencing errors and introns.

• Principle:• The exon model used in genewise is a HMM with

3 base states (match, insert, delete) with the addition of more transitions between states to consider frame-shifts.

• Intron states have been added to the base model.• Genewise directly compare HMM-profiles of

proteins or domains to the gene structure HMM model.

Page 45: Gene finding and gene structure prediction

Homology methods: Genewise

• Genewise is a powerful tool, but time consuming.

• Requires strong similarities (>70% identity) to produce good predictions.

• Genewise is part of the Wise2 package: http://www.ebi.ac.uk/Wise2/.

Page 46: Gene finding and gene structure prediction

Homology methods: Genewise

Page 47: Gene finding and gene structure prediction

Homology methods: sim4

• Align cDNA to genomic sequences.

• sim4 performs standard dynamic programming:• models splice sites• introns are treated as special kind of gaps with

low penalties

• sim4 performs very well, but needs strong similarity between the sequences.

Page 48: Gene finding and gene structure prediction

Homology methods: BLAST

• BLAST can be used to find genomic sequences similar to proteins, ESTs, cDNAs.

• A BLAST hit doesn’t mean necessarily an exon. Some post-processing is required.

• BLAST can indicate the rough position of exons, but nothing about the gene structure.

Page 49: Gene finding and gene structure prediction

Homology methods: BLAST

• However, BLAST is fast! and can reduce the search space for others programs.

Page 50: Gene finding and gene structure prediction

Homology methods: Trimming with BLAST


Top Related