genome evolution. amos tanay 2009 genome evolution lecture 4: species, genomes and trees

Genome Evolution. Amos Tanay 2009

Genome evolution

Lecture 4: Species, Genomes and Trees


What is a species?

• Multiple definitions..• free flow of genetic information within population• Weak (or zero) flow of information across species barriers

Species 1Species 2

Strain 1Strain 2

We change wright-fischer’s or Moran model, by removing the assumption of random mixing.

Instead, we can assume subpopulations are more likely to mate among themselves.

Different models are possible, all end up increasing the genetic distance between subpopulations


Speciation

Allopatric speciation – occurs through geographical separation

Parapatric speciation – occurs without geographical separation but with weak flow of genetic information

Sympatric speciation – occurs while information is flowing

Barriers can genetic, physical, and behavioral

The Phenomenon of new species emergence is called speciation

It is well accepted that speciation is driven by the formation of reproductive barriers


Allopatric speciation

Charis Butterflies in South America: different species

Åland Islands, Glanville fritillary population: same species

Factors that limit gene flows are quite diverse, and go beyond geography:Habitat, Sexual preferences, Season. Pollinator…

Many other factors can for a barrier:Physical incompatibility, Hybrid sterility (mule), pre- or pos-zygotic lethality…

“Finally, then, I suppose that a large number of closely allied or representative species... were originally formed in parts formerly isolated " (Darwin)


Sympatric speciation

Following Darwin, and prior to population genetics and genetics in general evolutionary biologists considered sympatric speciation as the leading factor generating new species.

The idea was that species are adapting to niches while co-existing in the same habitat

Sympatric speciation is however difficult to explain using standard population genetics of interbreeding populations. Myer (and Dobjhansky) have made strong arguments that suggested allopatric speciation is the major (or only) driver of bio-diversity

Results from the last 20-30 years have however suggested that sympatric speciation may still be important

Studies of cichlid fish species in African lakes showed incredible diversity: 500 endemic species in lake victoria, up to 1000 in lake Malawi

The history of some of these lakes may have included massive dry-out and geographical separation..

In smaller lake (shown here is Barombi Mbo in Cameron), dry-out is geographically unlikely, and several species (7) with a probable cone ommon ancestor do suggest sympatry


Species trees

Speciation is irreversible! (with some minor exceptions – think parasites)

We end up with a branching process: forming a tree

Species 1 Species 2

Strain 1 Strain 2

Species 1 Species 3

Strain 1 Strain 2

Present time

Species 2 Species 4

Strain 1 Strain 2

extinction


A little more about phylogenetics – next time


Facts on trees

•A tree is a connected graph without cycles

•We will use directed trees: each edge/lineage have a direction (time)

•Directed acyclic graph (DAG): a directed graph without cycles

•a Binary tree: one or 0 parents (incoming edges), two or 0 children (outgoing edges)

•A binary tree on n extant species will have n-1 inner nodes: (prove)

•Each node partition a binary tree into three disconnected parts (up, left, right)

•The root of the tree is the only node without parents

•Topological order: a permutation of the nodes such that each node appears after its parents

•BFS/DFS


Evolutionary inference

We can usually observe only the extent populations

But we want to infer the history of the evolutionary process

-How did the ancestral populations/species looked like? (nodes in the tree)

-What was the evolutionary process that brought an ancestral genome into an extant one? (edges in the tree)

So we will develop methods for inference: estimating the values of missing variables based on partial observations


Do we need inference?

Getting direct evidence on the evolutionary history is only partially possible:

The fossil record had probably given us more evolutionary understanding than any other resource (definitely more than genomes)

But it cannot teach us much on evolution at the genome level – and we cannot use it to learn how to read the genome itself

New technologies promise to sequence the genome of extinct species (mammoth, Neanderthals). But this is inherently limited by material availability


Why do we have a chance with inference?

We are trying to infer the past based on the present. Does this make any sense at all?

The past is correlated with the present

A:past B:present

Low substitution probability High correlation

A:past

B:present

)|Pr( AB ),( BACOV

)Pr()Pr(

)|Pr()|Pr( A

B

ABBA


Maximum parsimony

If we assume that the traits on the tree are changing slowly

Then the ancestral traits is usually the same as the extant one

We for each ancestral node, we have evidence coming in from 3 directions – almost always two of them should agree

Formally: given a tree T, and observations (from some alphabet) Si on the extent species:

1) compute the minimal number of changes along the tree,

2) Find the possible values at each ancestral node given an evolutionary scenario involving the minimal number of changes

?

A

C A

?C

C 2 substitutions

A

A 1 substitution


Maximum Parsimony Algorithm (Following Fitch 1971):

Start with D=0, up_set[i] a bitvector for each node

Up(i):if(extant) { up_set[i] = Si; return}up(right(i)), up(left(i))up_set[i] = up_set[right[i]] ∩ up_set[left[i]]if(up_set[i] = 0)

D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]]

Compute the minimal number of changes by calling Up(root)

Computing the parsimony score

?

S3

S2 S1

? up_set[4]

up_set[5]


Algorithm (Following Fitch 1971):

Up(i):if(extant) { up_set[i] = Si; return}up(right(i)), up(left(i))up_set[i] = up_set[right[i]] ∩ up_set[left[i]]if(up_set[i] = 0)

D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]]

Down(i):down_set[i] = up_set[sib[i]] ∩ down_set[par(i)]if(down_set[i] = 0) {

down_set[i] = up_set[sib[i]] + down_set[par(i)]}down(left(i)), down(right(i))

Algorithm:D=0up(root);down_set[root] = 0;down(right(root));down(left(root));

Parsimony “inference”

?

S3

S2 S1

? down_set[4]

down_set[5]

up_set[3]

Set[i] = up_set[i] ∩ down_set[i]


Genomic sequencing

In its first 100 years, evolutionary theory was about organismal traits

Starting from the 1960’s, molecular traits became available (mostly looking at proteins)

Since the 1990’s, and to its full extent today, we can cheaply sequence whole genomes

It is expected that within a few years, technology will allow routinely to study whole genomes in large population samples.

For example: The 3 billion dollars human genome project can now be done by a single lab within a few weeks for 100,000$, and the price rapidly dropping

The 1000 genomes project


~40,000,000 reads of ~36bp on each, 5k-10k$Jan 2010: 300 million reads, 150bpx2…

Sequencing technology is rapidly evolving:

Illumina GAII (here at WIS)


Genome evolution: nucleotides are not simple traits

A

C

AAA

AA

Point mutation (substitution)

Deletion

AA

AAA

Insertion

GGAACC

GGAAGGAACC

duplication

We transform nucleotides to traits using alignment

An alignment specifies which positions in two or more genomes represent the same “trait” – assuming they are the outcome of a single genealogy

As we are seeing this needs not be well defined! (e.g. duplications) – but we will have to usually assume it is.

A basic pairwise alignment optimization problem is solved using dynamic programming

Pairwise alignment: find the alignment minimizing the number (or some linear cost) of mismatches (including deletions/insertions characters)

Affine gap pairwise alignment: find the alignment minimizing the cost of mismatches + the cost of gaps (fixed cost for a new gap, another cost for a gap character)

(see any standard text on comp-genomics)


The alignment dynamic programming graph (for reference)

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Spe

cie

s 2

Species 1

Species 1

Spe

cie

s 2

Match/Mismatch

0 si,j = max si-1,j-1 + δ

(vi, wj)

s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj)

Local Alignment

Global Alignment

si-1,j-1 + δ (vi, wj)

si,j = max s i-1,j + δ (vi, -)

s i,j-1 + δ (-, wj)

How can we align all Query to part of the database?

a.k.a: Smith-Waterman, Needleman-Wunsch

Initialize 0,0 to


Multiple alignment

The problem: given a set of sequences (each from a difference species), find their optimal multiple alignment.

Multiple alignment cost: many possible definitions. In most of these the problem is NP-hard.

In fact, we should be looking for the complete evolutionary history of these sequences

Therefore, the optimal alignment should in principle define the genealogy of each nucleotide, such that these histories are reasonable

In practice, multiple alignment algorithms are using heuristics based on these ideas.

Designing and implementing a really principled version of these algorithms is not easy

…ACGAATAGCAGATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGAT……ACGTTTAGCAAATGGGCAGATGGCAGTCTAGA-----AGCATGAGACTAGATAGAT……ACGAATAGCAAAT------ATGCCAGTCTAGATCGAAAGCATGCCACTAGATAGAT…

1. Pairwise alignment (distances) 2. Build a “guide tree”

3. Align from leaves to root, each time a pair (sequences or profiles)


Genome alignment

Given a set of genomes, each consisting of several billion nts - Problem becomes quite intensive

Heuristics are used to search for pieces of alignment (Blast)

Pieces are then combined into chains of large fragments

Genome alignment can be projected over some reference genome, complex situations with duplications, large deletions and insertion requires complex solutions and are routinely ignored


Models for nucleotide substitutions

A

C T

G

A

C T

G

Jukes-Kantor Kimura

How to model the evolution of a nucleotide?

We discussed its potential allele frequency dynamics and fixation probability

The rate of substitution in a neutral locus:

N

NK2

12

But mutations can happen at different rates for different nucleotides. The two simplest models describing substitution rates are dated from the 60’s when sequence data was very scarce:


Rates and transition probabilities

The process’s rate matrix:

nininnn

ni

i

ni

i

qqqq

qqqq

qqqq

Q

..

..........

..........

..

..

210

1121110

0020100

Transitions differential equations (backward form):

)(]1)([)()(

)()()()()(

tPsPtPsP

tPtPsPtPtsP

ijiiik

kjik

ijk

kjikijij

)()()('0 tPqtPqtPs ijiiik

kjikij

)exp()()()(' QttPtQPtP


Matrix exponential

The differential equation:

)exp()()()(' QttPtQPtP

Series solution:

)exp(!

1

!))'(exp(

!

1)exp(

0

1

0

0

QtQtQi

QtQi

iQt

tQi

Qt

i

i

ii

i

i

i

i

i

1-path 2-path 3-path 4-path 5-path

Summing over different path lengths:


Computing the matrix exponential using spectral decomposition

t

t

t

ii

i

i

i

i

i

i

i

ne

e

e

t

tti

ti

Q

tQi

Qt

00

00

00

)exp(

)exp()!

1()(

!

1

!

1)exp(

2

1

00

0

The eigenvalues determine the process convergence propertiesThe largest eigenvalue must be 1: it associated eigenvector is the stationary distribution of the process.

the second largest dominates the convergence of the process


Computing the matrix exponential

i

i

itQi

Qt

0 !

1)exp(

Series methods: just take the first k summandsreasonable when ||A||<=1if the terms are converging, you are ok

can do scaling/squaring:

Eigenvalues/decomposition:good when the matrix is symmetricproblems when having similar eigenvalues

Multiple methods with other types of B (e.g., triangular)

m

m

QQ ee

0!

1iQ

i

1SSeB


The paradigm

Alignment

Ancestral Inference on a phylogenetic tree

Tree

Learning a model

Evolutionary rates

Detecting selection and functionPhylogenetics


The simple tree model

H2

S3

S2 S1

H1

Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T}

Extant Species Sj1,., Ancestral species Hj

1,..(n-1)

Tree T: Parents relation pa Si , pa Hi

(pa S1 = H1 ,pa S3 = H2 ,The root: H2)

For multiple loci we can assume independence and use the same parameters (today):

),Pr(),Pr( jjj hshs

ii paxxiii Qtxx ,)exp()pa|Pr(

)pa|Pr()Pr(),Pr( !ji

jirootiroot

jj xxhhs )|Pr()|Pr()|Pr(

)|Pr()Pr()Pr(

111223

212

hshshs

hhhs

In the triplet:

Structure

The model is defined using conditional probability distributions and the root “prior” probability distribution

Joint distribution

The model parameters can be the conditional probability distribution tables (CPDs)

Or we can have a single rate matrix Q and branch lengths:

96.001.002.001.0

01.096.001.002.0

02.001.096.001.0

01.002.001.096.0

)|Pr( yx


Ancestral inference

Alignment

Ancestral Inference on a phylogenetic tree

Tree

Learning a model

Evolutionary rates

)pa|Pr()Pr()|,Pr( !ji

jirootiroot

jj xxhhs

h

shPs )|,()|Pr(

The Total probability of the data s:

This is also called the likelihood L(). Computing Pr(s) is the inference problem

)|Pr(

)|,(),|Pr(

s

shPsh Given the total probability it is easy

to compute:

)|Pr(/),(),|Pr(|

sshPsxhxhh

i

i

Easy!

Exponential?

Marginalization over hi

We assume the model (structure, parameters) is given, and denote it by :

Posterior of hi given the data

Total probability of the data


Tree models

?

A

C A

?

xhh

i

i

shPsxh|

),()|Pr(

Given partial observations s:

)),,Pr(( ACA

The Total probability of the data:

)),,(|Pr( 1 ACAAh

96.001.002.001.0

01.096.001.002.0

02.001.096.001.0

01.002.001.096.0

)|Pr( yx

Uniform prior


Algorithm (Following Felsenstein 1981):

Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a

up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]

Down(i):

down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]

down(r(i)), down(l(i))Algorithm:

up(root);LL = 0;foreach a {

L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)

}down(r(root));down(l(root));

Dynamic programming to compute the total probability

?

S3

S2 S1

? up[4]

up[5]

Felsentstein


Algorithm (Following Felsenstein 1981):

Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a

up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]

Down(i):

down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]

down(r(i)), down(l(i))Algorithm:

up(root);LL = 0;foreach a {

L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)

}down(r(root));down(l(root));

?

S3

S2 S1

? down[4]

down5]

up[3]

P(hi|s) = up[i][c]*down[i][c]/

(jup[i][j]down[i][j])

Felsentstein

Computing marginals and posteriors


Transition posteriors: not independent!

A CA

C

DATA

96.001.002.001.0

01.096.001.002.0

02.001.096.001.0

01.002.001.096.0

)|Pr( yxDown:(0.25),(0.25),(0.25),(0.25)

Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

genome evolution. amos tanay 2009 genome evolution lecture 4: species, genomes and trees

Documents