genome evolution. amos tanay 2009 genome evolution lecture 4: species, genomes and trees
Post on 20-Jan-2016
230 views
TRANSCRIPT
Genome Evolution. Amos Tanay 2009
Genome evolution
Lecture 4: Species, Genomes and Trees
Genome Evolution. Amos Tanay 2009
What is a species?
• Multiple definitions..• free flow of genetic information within population• Weak (or zero) flow of information across species barriers
Species 1Species 2
Strain 1Strain 2
We change wright-fischer’s or Moran model, by removing the assumption of random mixing.
Instead, we can assume subpopulations are more likely to mate among themselves.
Different models are possible, all end up increasing the genetic distance between subpopulations
Genome Evolution. Amos Tanay 2009
Speciation
Allopatric speciation – occurs through geographical separation
Parapatric speciation – occurs without geographical separation but with weak flow of genetic information
Sympatric speciation – occurs while information is flowing
Barriers can genetic, physical, and behavioral
The Phenomenon of new species emergence is called speciation
It is well accepted that speciation is driven by the formation of reproductive barriers
Genome Evolution. Amos Tanay 2009
Allopatric speciation
Charis Butterflies in South America: different species
Åland Islands, Glanville fritillary population: same species
Factors that limit gene flows are quite diverse, and go beyond geography:Habitat, Sexual preferences, Season. Pollinator…
Many other factors can for a barrier:Physical incompatibility, Hybrid sterility (mule), pre- or pos-zygotic lethality…
“Finally, then, I suppose that a large number of closely allied or representative species... were originally formed in parts formerly isolated " (Darwin)
Genome Evolution. Amos Tanay 2009
Sympatric speciation
Following Darwin, and prior to population genetics and genetics in general evolutionary biologists considered sympatric speciation as the leading factor generating new species.
The idea was that species are adapting to niches while co-existing in the same habitat
Sympatric speciation is however difficult to explain using standard population genetics of interbreeding populations. Myer (and Dobjhansky) have made strong arguments that suggested allopatric speciation is the major (or only) driver of bio-diversity
Results from the last 20-30 years have however suggested that sympatric speciation may still be important
Studies of cichlid fish species in African lakes showed incredible diversity: 500 endemic species in lake victoria, up to 1000 in lake Malawi
The history of some of these lakes may have included massive dry-out and geographical separation..
In smaller lake (shown here is Barombi Mbo in Cameron), dry-out is geographically unlikely, and several species (7) with a probable cone ommon ancestor do suggest sympatry
Genome Evolution. Amos Tanay 2009
Species trees
Speciation is irreversible! (with some minor exceptions – think parasites)
We end up with a branching process: forming a tree
Species 1 Species 2
Strain 1 Strain 2
Species 1 Species 3
Strain 1 Strain 2
Present time
Species 2 Species 4
Strain 1 Strain 2
extinction
Genome Evolution. Amos Tanay 2009
A little more about phylogenetics – next time
Genome Evolution. Amos Tanay 2009
Facts on trees
•A tree is a connected graph without cycles
•We will use directed trees: each edge/lineage have a direction (time)
•Directed acyclic graph (DAG): a directed graph without cycles
•a Binary tree: one or 0 parents (incoming edges), two or 0 children (outgoing edges)
•A binary tree on n extant species will have n-1 inner nodes: (prove)
•Each node partition a binary tree into three disconnected parts (up, left, right)
•The root of the tree is the only node without parents
•Topological order: a permutation of the nodes such that each node appears after its parents
•BFS/DFS
Genome Evolution. Amos Tanay 2009
Evolutionary inference
We can usually observe only the extent populations
But we want to infer the history of the evolutionary process
-How did the ancestral populations/species looked like? (nodes in the tree)
-What was the evolutionary process that brought an ancestral genome into an extant one? (edges in the tree)
So we will develop methods for inference: estimating the values of missing variables based on partial observations
Genome Evolution. Amos Tanay 2009
Do we need inference?
Getting direct evidence on the evolutionary history is only partially possible:
The fossil record had probably given us more evolutionary understanding than any other resource (definitely more than genomes)
But it cannot teach us much on evolution at the genome level – and we cannot use it to learn how to read the genome itself
New technologies promise to sequence the genome of extinct species (mammoth, Neanderthals). But this is inherently limited by material availability
Genome Evolution. Amos Tanay 2009
Why do we have a chance with inference?
We are trying to infer the past based on the present. Does this make any sense at all?
The past is correlated with the present
A:past B:present
Low substitution probability High correlation
A:past
B:present
)|Pr( AB ),( BACOV
)Pr()Pr(
)|Pr()|Pr( A
B
ABBA
Genome Evolution. Amos Tanay 2009
Maximum parsimony
If we assume that the traits on the tree are changing slowly
Then the ancestral traits is usually the same as the extant one
We for each ancestral node, we have evidence coming in from 3 directions – almost always two of them should agree
Formally: given a tree T, and observations (from some alphabet) Si on the extent species:
1) compute the minimal number of changes along the tree,
2) Find the possible values at each ancestral node given an evolutionary scenario involving the minimal number of changes
?
A
C A
?C
C 2 substitutions
A
A 1 substitution
Genome Evolution. Amos Tanay 2009
Maximum Parsimony Algorithm (Following Fitch 1971):
Start with D=0, up_set[i] a bitvector for each node
Up(i):if(extant) { up_set[i] = Si; return}up(right(i)), up(left(i))up_set[i] = up_set[right[i]] ∩ up_set[left[i]]if(up_set[i] = 0)
D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]]
Compute the minimal number of changes by calling Up(root)
Computing the parsimony score
?
S3
S2 S1
? up_set[4]
up_set[5]
Genome Evolution. Amos Tanay 2009
Algorithm (Following Fitch 1971):
Up(i):if(extant) { up_set[i] = Si; return}up(right(i)), up(left(i))up_set[i] = up_set[right[i]] ∩ up_set[left[i]]if(up_set[i] = 0)
D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]]
Down(i):down_set[i] = up_set[sib[i]] ∩ down_set[par(i)]if(down_set[i] = 0) {
down_set[i] = up_set[sib[i]] + down_set[par(i)]}down(left(i)), down(right(i))
Algorithm:D=0up(root);down_set[root] = 0;down(right(root));down(left(root));
Parsimony “inference”
?
S3
S2 S1
? down_set[4]
down_set[5]
up_set[3]
Set[i] = up_set[i] ∩ down_set[i]
Genome Evolution. Amos Tanay 2009
Genomic sequencing
In its first 100 years, evolutionary theory was about organismal traits
Starting from the 1960’s, molecular traits became available (mostly looking at proteins)
Since the 1990’s, and to its full extent today, we can cheaply sequence whole genomes
It is expected that within a few years, technology will allow routinely to study whole genomes in large population samples.
For example: The 3 billion dollars human genome project can now be done by a single lab within a few weeks for 100,000$, and the price rapidly dropping
The 1000 genomes project
Genome Evolution. Amos Tanay 2009
~40,000,000 reads of ~36bp on each, 5k-10k$Jan 2010: 300 million reads, 150bpx2…
Sequencing technology is rapidly evolving:
Illumina GAII (here at WIS)
Genome Evolution. Amos Tanay 2009
Genome evolution: nucleotides are not simple traits
A
C
AAA
AA
Point mutation (substitution)
Deletion
AA
AAA
Insertion
GGAACC
GGAAGGAACC
duplication
We transform nucleotides to traits using alignment
An alignment specifies which positions in two or more genomes represent the same “trait” – assuming they are the outcome of a single genealogy
As we are seeing this needs not be well defined! (e.g. duplications) – but we will have to usually assume it is.
A basic pairwise alignment optimization problem is solved using dynamic programming
Pairwise alignment: find the alignment minimizing the number (or some linear cost) of mismatches (including deletions/insertions characters)
Affine gap pairwise alignment: find the alignment minimizing the cost of mismatches + the cost of gaps (fixed cost for a new gap, another cost for a gap character)
(see any standard text on comp-genomics)
Genome Evolution. Amos Tanay 2009
The alignment dynamic programming graph (for reference)
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j
Spe
cie
s 2
Species 1
Species 1
Spe
cie
s 2
Match/Mismatch
0 si,j = max si-1,j-1 + δ
(vi, wj)
s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj)
Local Alignment
Global Alignment
si-1,j-1 + δ (vi, wj)
si,j = max s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
How can we align all Query to part of the database?
a.k.a: Smith-Waterman, Needleman-Wunsch
Initialize 0,0 to
Genome Evolution. Amos Tanay 2009
Multiple alignment
The problem: given a set of sequences (each from a difference species), find their optimal multiple alignment.
Multiple alignment cost: many possible definitions. In most of these the problem is NP-hard.
In fact, we should be looking for the complete evolutionary history of these sequences
Therefore, the optimal alignment should in principle define the genealogy of each nucleotide, such that these histories are reasonable
In practice, multiple alignment algorithms are using heuristics based on these ideas.
Designing and implementing a really principled version of these algorithms is not easy
…ACGAATAGCAGATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGAT……ACGTTTAGCAAATGGGCAGATGGCAGTCTAGA-----AGCATGAGACTAGATAGAT……ACGAATAGCAAAT------ATGCCAGTCTAGATCGAAAGCATGCCACTAGATAGAT…
1. Pairwise alignment (distances) 2. Build a “guide tree”
3. Align from leaves to root, each time a pair (sequences or profiles)
Genome Evolution. Amos Tanay 2009
Genome alignment
Given a set of genomes, each consisting of several billion nts - Problem becomes quite intensive
Heuristics are used to search for pieces of alignment (Blast)
Pieces are then combined into chains of large fragments
Genome alignment can be projected over some reference genome, complex situations with duplications, large deletions and insertion requires complex solutions and are routinely ignored
Genome Evolution. Amos Tanay 2009
Models for nucleotide substitutions
A
C T
G
A
C T
G
Jukes-Kantor Kimura
How to model the evolution of a nucleotide?
We discussed its potential allele frequency dynamics and fixation probability
The rate of substitution in a neutral locus:
N
NK2
12
But mutations can happen at different rates for different nucleotides. The two simplest models describing substitution rates are dated from the 60’s when sequence data was very scarce:
Genome Evolution. Amos Tanay 2009
Rates and transition probabilities
The process’s rate matrix:
nininnn
ni
i
ni
i
qqqq
qqqq
qqqq
Q
..
..........
..........
..
..
210
1121110
0020100
Transitions differential equations (backward form):
)(]1)([)()(
)()()()()(
tPsPtPsP
tPtPsPtPtsP
ijiiik
kjik
ijk
kjikijij
)()()('0 tPqtPqtPs ijiiik
kjikij
)exp()()()(' QttPtQPtP
Genome Evolution. Amos Tanay 2009
Matrix exponential
The differential equation:
)exp()()()(' QttPtQPtP
Series solution:
)exp(!
1
!))'(exp(
!
1)exp(
0
1
0
0
QtQtQi
QtQi
iQt
tQi
Qt
i
i
ii
i
i
i
i
i
1-path 2-path 3-path 4-path 5-path
Summing over different path lengths:
Genome Evolution. Amos Tanay 2009
Computing the matrix exponential using spectral decomposition
t
t
t
ii
i
i
i
i
i
i
i
ne
e
e
t
tti
ti
Q
tQi
Qt
00
00
00
)exp(
)exp()!
1()(
!
1
!
1)exp(
2
1
00
0
The eigenvalues determine the process convergence propertiesThe largest eigenvalue must be 1: it associated eigenvector is the stationary distribution of the process.
the second largest dominates the convergence of the process
Genome Evolution. Amos Tanay 2009
Computing the matrix exponential
i
i
itQi
Qt
0 !
1)exp(
Series methods: just take the first k summandsreasonable when ||A||<=1if the terms are converging, you are ok
can do scaling/squaring:
Eigenvalues/decomposition:good when the matrix is symmetricproblems when having similar eigenvalues
Multiple methods with other types of B (e.g., triangular)
m
m
QQ ee
0!
1iQ
i
1SSeB
Genome Evolution. Amos Tanay 2009
The paradigm
Alignment
Ancestral Inference on a phylogenetic tree
Tree
Learning a model
Evolutionary rates
Detecting selection and functionPhylogenetics
Genome Evolution. Amos Tanay 2009
The simple tree model
H2
S3
S2 S1
H1
Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T}
Extant Species Sj1,., Ancestral species Hj
1,..(n-1)
Tree T: Parents relation pa Si , pa Hi
(pa S1 = H1 ,pa S3 = H2 ,The root: H2)
For multiple loci we can assume independence and use the same parameters (today):
),Pr(),Pr( jjj hshs
ii paxxiii Qtxx ,)exp()pa|Pr(
)pa|Pr()Pr(),Pr( !ji
jirootiroot
jj xxhhs )|Pr()|Pr()|Pr(
)|Pr()Pr()Pr(
111223
212
hshshs
hhhs
In the triplet:
Structure
The model is defined using conditional probability distributions and the root “prior” probability distribution
Joint distribution
The model parameters can be the conditional probability distribution tables (CPDs)
Or we can have a single rate matrix Q and branch lengths:
96.001.002.001.0
01.096.001.002.0
02.001.096.001.0
01.002.001.096.0
)|Pr( yx
Genome Evolution. Amos Tanay 2009
Ancestral inference
Alignment
Ancestral Inference on a phylogenetic tree
Tree
Learning a model
Evolutionary rates
)pa|Pr()Pr()|,Pr( !ji
jirootiroot
jj xxhhs
h
shPs )|,()|Pr(
The Total probability of the data s:
This is also called the likelihood L(). Computing Pr(s) is the inference problem
)|Pr(
)|,(),|Pr(
s
shPsh Given the total probability it is easy
to compute:
)|Pr(/),(),|Pr(|
sshPsxhxhh
i
i
Easy!
Exponential?
Marginalization over hi
We assume the model (structure, parameters) is given, and denote it by :
Posterior of hi given the data
Total probability of the data
Genome Evolution. Amos Tanay 2009
Tree models
?
A
C A
?
xhh
i
i
shPsxh|
),()|Pr(
Given partial observations s:
)),,Pr(( ACA
The Total probability of the data:
)),,(|Pr( 1 ACAAh
96.001.002.001.0
01.096.001.002.0
02.001.096.001.0
01.002.001.096.0
)|Pr( yx
Uniform prior
Genome Evolution. Amos Tanay 2009
Algorithm (Following Felsenstein 1981):
Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a
up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]
Down(i):
down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]
down(r(i)), down(l(i))Algorithm:
up(root);LL = 0;foreach a {
L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)
}down(r(root));down(l(root));
Dynamic programming to compute the total probability
?
S3
S2 S1
? up[4]
up[5]
Felsentstein
Genome Evolution. Amos Tanay 2009
Algorithm (Following Felsenstein 1981):
Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a
up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b] Pr(Xr(i)=c|Xi=a) up[r(i)][c]
Down(i):
down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]
down(r(i)), down(l(i))Algorithm:
up(root);LL = 0;foreach a {
L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)
}down(r(root));down(l(root));
?
S3
S2 S1
? down[4]
down5]
up[3]
P(hi|s) = up[i][c]*down[i][c]/
(jup[i][j]down[i][j])
Felsentstein
Computing marginals and posteriors
Genome Evolution. Amos Tanay 2009
Transition posteriors: not independent!
A CA
C
DATA
96.001.002.001.0
01.096.001.002.0
02.001.096.001.0
01.002.001.096.0
)|Pr( yxDown:(0.25),(0.25),(0.25),(0.25)
Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)
Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)