modelling genomes

21
Modelling genomes Gil McVean Department of Statistics, Oxford

Upload: zasha

Post on 17-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Modelling genomes. Gil McVean Department of Statistics, Oxford. Why would we want to model a genome?. To identify genes Protein-coding RNA Small RNAs To identify regulatory elements Transcription factor binding sites Enhancers To classify genome content Repeat DNA Unique sequence - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modelling genomes

Modelling genomes

Gil McVeanDepartment of Statistics, Oxford

Page 2: Modelling genomes

Why would we want to model a genome?

• To identify genes– Protein-coding– RNA– Small RNAs

• To identify regulatory elements– Transcription factor binding sites– Enhancers

• To classify genome content– Repeat DNA– Unique sequence

• To understand the processes that shape genomes– Mutation– Recombination– Duplication– Rearrangement– Natural selection

Page 3: Modelling genomes

Non-coding DNA Start codon Codon Non-coding DNATERMs t

A rather simple model for a protein-coding gene

T: 30%C: 20%A: 30%G: 20%

ATG: 100% AAA: 1/61%AAC: 1/61%AAG: 1/61%…TTT: 1/61%

T: 30%C: 20%A: 30%G: 20%

TAA: 30%TAG: 40%TGA: 30%

ST

AT

ES

EM

ISS

ION

S

Page 4: Modelling genomes

Define model

Explore properties

Estimate parametersfrom data

Test goodness-of-fit

Refine model

A ‘genome’ model is like any other statistical model

Page 5: Modelling genomes

Hidden Markov Models in bioinformatics

• The model of a gene just described can be thought of as a hidden Markov model (HMM)

– The underlying states evolve in a Markov fashion, but we observe features (the DNA sequence) emitted by those states

• You will remember that there are lots of nice computational properties of hidden Markov models that we can use for inference

– Finding a most likely sequence of states– Calculating posterior probabilities of a given state at a given position

• There are also various algorithms we can use to estimate parameters of HMMs (e.g. ML estimation by EM)

• How would you use the model of a gene to find new genes?– How well do you think it would do?

Page 6: Modelling genomes

Making useful HMMs in bioinformatics

• To be useful, HMMs for genes have to incorporate many features– Regulatory sequences– Intron-splicing features– Correlations and biases in amino acid and base composition

• A REALLY important feature to capture is their evolution– Important parts of genes and genomes evolve slower due to constraint

Page 7: Modelling genomes

Searching for homology

• If we compare human and chimpanzee sequences they are approximately 98.8% identical at the DNA level. It is ‘easy’ to identify which parts of the genome in humans correspond to which parts in chimps

• If we compare human with, say mouse, we can see some parts that are similar, and other parts where there is only vague or even no obvious similarity.

• When measuring evolution, we need to identify regions that are homologous

– Homology means similarity by descent

• Traditionally, the problem of identifying homology has been intrinsically linked to the problem of alignment

Page 8: Modelling genomes

Alignment of PFEMP1 proteins from P. falciparum

Page 9: Modelling genomes

The simplest problem: aligning two sequences

• Suppose we have just two protein sequences that we want to align

• In evolution, three types of event can happen– Mutation to new amino acids– Insertion of new amino acids– Deletion of amino acids

• We want to work out which amino acids in the two sequences are homologous – i.e. related to each other through shared ancestry

WAKISWEEKS

W—AKIS

WEEK-S What do the ‘-’s really mean?

Page 10: Modelling genomes

How can we construct an alignment algorithm?

• What we want to do is to look at every possible alignment and choose the one that is ‘best’

• What we have to do is to find an efficient algorithm that can search every possible alignment and that has an objective measure as to what ‘best’ means

• A natural approach is to make a model of alignments, parameterise it and find the alignment that maximises the likelihood

• Although the problem sounds hard we can solve it using a hidden Markov model structure

Page 11: Modelling genomes

Xi

Yj

How does is work?

• Suppose residues Xi and Yj are aligned to each other

• Three things could happen next– The next two residues in each sequence could also align (A) – A gap could be introduced in sequence X

(B)– A gap could be introduced in sequence Y

(C)

• We can parameterise the probabilities of each event

XiXi+1

YjYj+1

Xi-

YjYj+1

XiXi+1

Yj-(A) (B) (C)

Page 12: Modelling genomes

The full algorithm

• We need to consider similar transitions for the cases when residue Xi is aligned to a gap after residue Yj, and when Yj is aligned to a gap after Xi

• We need to specify various probabilities– The probability of inserting a gap– The probability of extending a gap– The probability of finishing the alignment– The probability of observing an aligned pair of residues (20x20)– The probability of observing a residue aligned to a gap (20)

• Once specified we can use the Viterbi and Forward/Backward algorithms to identify ML alignments, sample from the posterior or calculate posterior probabilities

Xi-a…Xi

Yj …-

Xi …-

Yj-a…Yj

Page 13: Modelling genomes

H

D

H

Xi+1

)()()()1( 1 iHDHDHHHH Xeqifqifif

Transition probabilities= qij

Emission probabilities= ek(Xi+1 )

In alignment the state space is two-dimensional (residue i aligned to residue j)

The forward algorithm

Page 14: Modelling genomes

H

D

H

D

Xi+1Xi

H

D

Xi-1

jkjjikk qivXeiv )(max)()1( 1

The Viterbi algorithm

A traceback matrix is used to keep track of the best partial alignments

Page 15: Modelling genomes

An example

• Suppose the gap opening and extension parameters are 0.2 and 0.5 respectively. There is a 80% chance of observing a match, a 20/19% chance of observing any given mismatch and a 5% chance of observing each unaligned amino acid (We can ignore termination for the moment)

• The BEST alignments are given below, each of which has log likelihood of -16.84, or 31% of the total likelihood (lnlk = -15.67).

• In many real situations, the best alignment represents only a fraction of the total likelihood

W—AKIS

WEEK-S

WA-KIS

WEEK-S

Page 16: Modelling genomes

Posterior decoding

• Using the forward-backward algorithm we can calculate the posterior probability that any residue is aligned to any other, or that a given residue is in a gap state

X1 X2 X3 X4 X5Y1

Y2

Y3

Y4

Y5

X1 X2 X3 X4 X5Y1

Y2

Y3

Y4

Y5

Conditional on X2-Y3

Page 17: Modelling genomes

Extending the method

• Originally, alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman, 1981; Gotoh 1982) were not explicitly defined as hidden Markov models

– Finite-state automata (FSA)

• There have been many extensions to the original idea– Local alignment– Repeat alignment– Protein family identification– Gene finding– Multiple alignment

• The alignment algorithm is very much a workhorse of bioinformatics, as an alignment is needed or almost all subsequent analyses (e.g. phylogenetic tree reconstruction, population genetic inference)

– However, relying on a single alignment is not always a great idea

Page 18: Modelling genomes

Doing away with alignment

• For most problems, the alignment is not of primary interest

• The natural thing to do is to integrate over alignments (as in the FB algorithm) to estimate parameters of interest

• The key problem is that there is no computationally efficient algorithm for statistical multiple alignment. All widely-used methods use heuristic approaches

Page 19: Modelling genomes

Gene conversion and var gene diversity in P. falciparum

• Multiple alignment methods typically assume the sequences are related to each through an evolutionary tree

• For the case of multi-gene families, this may not be the case, because gene conversion between copies can lead to mosaic structures

• If we wish to learn about the processes of conversion, a natural approach is to model the mosaicism

– In the case of var genes, the sequences are so diverged that we also need to consider the problem of alignment

Page 20: Modelling genomes

Mosaic alignment

• We could model the n+1th sequence as a mosaic of the previous n

• We can calculate the likelihood of observing a given sequence by summing over all possible mosaic structures and their alignment

• We can also identify the most likely mosaic structure and calculate the expected number of recombination events

– Repeating the procedure for all sequences provides a way of assessing the importance of mosaicism within the family

Page 21: Modelling genomes

Extensive mosaicism within the var family