gene finding and gene structure prediction

51
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012

Upload: gomer

Post on 23-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Gene finding and gene structure prediction. M. Fatih BÜYÜKAKÇALI 2008639500 Computational Bioinformatics 2012. Outline. Introduction to genes and proteins Genetic code Open reading frames. Outline. Ab initio methods Principles: signal detection and coding statistics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene finding and gene structure prediction

Gene finding and genestructure prediction

M. Fatih BÜYÜKAKÇALI2008639500

Computational Bioinformatics2012

Page 2: Gene finding and gene structure prediction

Outline

• Introduction to genes and proteins

• Genetic code

• Open reading frames

Page 3: Gene finding and gene structure prediction

Outline

• Ab initio methods• Principles: signal detection and coding

statistics• Methods to integrate signal detection and

coding statistics• Examples of software

Page 4: Gene finding and gene structure prediction

Outline

• Homology methods• Principles• An overview of the homology methods

Page 5: Gene finding and gene structure prediction

Introduction to genes and proteins• Proteins are the main building block for

many tasks in living organisms. They are themselves build up as a chain of amino-acids (AA) (200-300, typically).

• The chain of amino-acids of a protein is produced by translation of an RNA sequence (via the ribosome; while translation takes place, the protein folds progressively to take its three-dimensional structure).

Page 6: Gene finding and gene structure prediction

Introduction to genes and proteins• The RNA sequence needed to

produce a given protein is normally obtained by transcribing a part of the DNA contained in the genome (it is then called mRNA) and the corresponding subsequence of DNA is called a gene coding for that protein.

Page 7: Gene finding and gene structure prediction

Introduction to genes and proteins• A given genome can contain as few as

500 genes or as many as 30,000 genes.

• Central dogma: DNARNAProtein

Page 8: Gene finding and gene structure prediction

Introduction: gene structure

Page 9: Gene finding and gene structure prediction

Genetic code• Correspondence between tri-mers

(codons) of nucleotides and amino-acids

• 20 amino-acids, but 64 codons (see book, or Internet, for explanations)

• Some amino-acids correspond to several codons: A (Alanine) corresponds to GCA, GCG, GCT

Page 10: Gene finding and gene structure prediction

Genetic code• Some codons do not correspond to an

amino-acid: TAA, TAG, TGA (these are stop codons, see below).

• One codon is special: ATG, it is the sole codon corresponding to Methionine, and is also called start codon (see below).

NB. Although it is RNA that is translated into amino-acids, we use the DNA alphabet (T instead of U) to describe the genetic code, because we will directly search DNA sequences for protein coding sequences.

Page 11: Gene finding and gene structure prediction

Open reading framesAn open reading frame, is a sequence of DNA nucleotides that could be translated into a protein. We know that:

• Translation goes from 5’ to 3’ end of a strain (sense, or anti-sense)

• Translation always starts with a methionine codon (ATG)

Page 12: Gene finding and gene structure prediction

Open reading frames

Page 13: Gene finding and gene structure prediction

Open reading frames

• Translation always stops, as soon as a stop codon is found (and the AA-sequence ends with the AA corresponding to the last non-stop codon).

Page 14: Gene finding and gene structure prediction

What is gene finding?

• From a genomic DNA sequence we want to predict the regions that will encode for a protein: the genes.

• Gene finding is about detecting these coding regions and infer the gene structure starting from genomic DNA sequences.

Page 15: Gene finding and gene structure prediction

What is gene finding?

• We need to distinguish coding from non-coding regions using properties specific to each type of DNA region.

• Gene finding is not an easy task!

Page 16: Gene finding and gene structure prediction

What is gene finding?• Gene finding is not an easy task!

• DNA sequence signals have low information content (small alphabet and short sequences);

• It is difficult to discriminate real signals from noise (degenerated and highly unspecific signals);

• Gene structure can be complex (sparse exons, alternative splicing, ...);

• DNA signals may vary in different organisms;• Sequencing errors (frame shifts, ...).

Page 17: Gene finding and gene structure prediction

Gene structure in prokaryotes

• High gene density and simple gene structure.

• Short genes have little information.• Overlapping genes.

Page 18: Gene finding and gene structure prediction

Gene structure in eukaryotes

• Low gene density and complex gene structure.

• Alternative splicing.• Pseudo-genes.

Page 19: Gene finding and gene structure prediction

Gene finding strategies

• Ab initio methods:• Based on statistical signals within the DNA:

• Signals: short DNA motifs (promoters, start/stop codons, splice sites, ...)

• Coding statistics: nucleotide compositional bias in coding and non-coding regions

Page 20: Gene finding and gene structure prediction

Gene finding strategies

• Strengths:• easy to run and fast execution time• only require the DNA sequence as input

Page 21: Gene finding and gene structure prediction

Gene finding strategies

• Weaknesses:• prior knowledge is required (training sets)• high number of mispredicted gene structures

Page 22: Gene finding and gene structure prediction

Gene finding strategies

• Homology methods:• Gene structure is deduced using

homologous sequences (EST, mRNA, protein).

• Very accurate results when using homologous sequences with high similarity.

Page 23: Gene finding and gene structure prediction

Gene finding strategies

• Strengths:• accurate

• Weaknesses:• need of good homologous sequences• execution is slow

Page 24: Gene finding and gene structure prediction

Gene finding:

Ab initio methods

Page 25: Gene finding and gene structure prediction

Ab initio methods: a simple view

Page 26: Gene finding and gene structure prediction

Methods for signal detection

• Detect short DNA motifs (promoters, start/stop codons, splice sites, intron branching point, ...).

Page 27: Gene finding and gene structure prediction

Methods for signal detection• A number of methods are used for

signal detection:• Consensus string: based on most frequently

observed residues at a given position.

• Pattern recognition: flexible consensus strings.

• Weight matrices: based on observed frequencies of residues at a given position. Uses standard alignment algorithms.

Page 28: Gene finding and gene structure prediction

Methods for signal detection• A number of methods are used for

signal detection:• Weight array matrices: weight matrices based on

dinucleotides frequencies. Takes into account the non-independence of adjacent positions in the sites.

• Maximal dependence decomposition (MDD): MDD generates a model which captures significant dependencies between non-adjacent as well adjacent positions, starting from an aligned set of signals.

Page 29: Gene finding and gene structure prediction

Methods for signal detection• Methods for signal detection:

• Hidden Markov Models (HMMs):• HMMs use a probabilistic framework to infer

the probability that a sequence correspond to a real signal.

• Neural Networks (NNs):• NNs are trained with positive and negative

examples. NNs ”discover” the features that distinguish the two sets.

Page 30: Gene finding and gene structure prediction

Methods for signal detection

Page 31: Gene finding and gene structure prediction

Signal detection limitations• Problems with signal detection:

• DNA sequence signals have low information content.

• Signals are highly unspecific and degenerated.• Difficult to distinguish between true and false

positive.

• How to improve signal detection:• Take context into consideration (ex. acceptor site

must be flanked by an intron and an exon).• Combine with coding statistics (compositional

bias).

Page 32: Gene finding and gene structure prediction

Types of coding statistics• Inter-genic regions, introns, and exons

have different nucleotides contents.• This compositional differences can be

used to infer gene structure.• Examples of coding statistics:

• ORF length:• Assuming an uniform random distribution,

stop codons are present every 64/3 codons (≈ 21 codons) in average.

• In coding regions stop codon average decrease

Page 33: Gene finding and gene structure prediction

Types of coding statistics

• This measure is sensitive to frame shift errors.• Can’t detect short coding regions

• Bias in nucleotide content in coding regions:• Generally coding regions are G+C rich.• There are exceptions! For example coding

regions of P. falciparum are A+T rich.

Page 34: Gene finding and gene structure prediction

Integrating signal and compositional information for gene structure prediction

• A number of methods exists for gene structure prediction which integrate different techniques to detect signals (splicing sites, promoters, etc.) and coding statistics.

• All these methods are classifiers based on machine learning theory.

• Training sets are required to train the algorithms.

Page 35: Gene finding and gene structure prediction

Ab initio methods: Generalized HMMs

Page 36: Gene finding and gene structure prediction

Ab initio methods: Generalized HMMs

Page 37: Gene finding and gene structure prediction

The Other Ab initio methods

• GENSCAN• HMMgene• Linear and quadratic discrimination analysis• FGENES• MZEF• Decision trees• Neural network• GRAIL

Page 38: Gene finding and gene structure prediction
Page 39: Gene finding and gene structure prediction
Page 40: Gene finding and gene structure prediction

Gene finding:

Homology methods

Page 41: Gene finding and gene structure prediction

Homology methods: a simple view

Page 42: Gene finding and gene structure prediction

Homology methods: Procrustes

Procrustes: robber who altered his victims to fit his bed by stretching them or cutting off their legs (Classical Mythology)

Page 43: Gene finding and gene structure prediction

Homology methods: Procrustes

Page 44: Gene finding and gene structure prediction

Homology methods: Genewise

• Uses HMMs to compare DNA sequences to protein sequences at the level of its conceptual translation, regardless of sequencing errors and introns.

• Principle:• The exon model used in genewise is a HMM with

3 base states (match, insert, delete) with the addition of more transitions between states to consider frame-shifts.

• Intron states have been added to the base model.• Genewise directly compare HMM-profiles of

proteins or domains to the gene structure HMM model.

Page 45: Gene finding and gene structure prediction

Homology methods: Genewise

• Genewise is a powerful tool, but time consuming.

• Requires strong similarities (>70% identity) to produce good predictions.

• Genewise is part of the Wise2 package: http://www.ebi.ac.uk/Wise2/.

Page 46: Gene finding and gene structure prediction

Homology methods: Genewise

Page 47: Gene finding and gene structure prediction

Homology methods: sim4

• Align cDNA to genomic sequences.

• sim4 performs standard dynamic programming:• models splice sites• introns are treated as special kind of gaps with

low penalties

• sim4 performs very well, but needs strong similarity between the sequences.

Page 48: Gene finding and gene structure prediction

Homology methods: BLAST

• BLAST can be used to find genomic sequences similar to proteins, ESTs, cDNAs.

• A BLAST hit doesn’t mean necessarily an exon. Some post-processing is required.

• BLAST can indicate the rough position of exons, but nothing about the gene structure.

Page 49: Gene finding and gene structure prediction

Homology methods: BLAST

• However, BLAST is fast! and can reduce the search space for others programs.

Page 50: Gene finding and gene structure prediction

Homology methods: Trimming with BLAST