thesis: inferring phylogenetic networkswbbpredictions.com/wp-content/uploads/2019/05/travis...they...

Thesis: Inferring Phylogenetic Networks

Travis Barton

May 28, 2019

1 Abstract

The reconstruction of phylogenetic structures is a classic applied biology pursuit and is themain focus of many researchers in the field of evolutionary biology. While traditionally,these structures were assumed to be trees, recent opinion amongst phylogeneticists is thatnetwork structures may sometimes be a better fit (network structures meaning either anon-tree graph or a collection of multiple trees). This introduces a new problem, however,namely which structure best describes the evolutionary history of a set of species. Thepurpose of this thesis is to shed some light on that topic, suggesting our own decisionalgorithm that classifies a set of aligned DNA sequences from four species into a tree ornetwork structure. We select one of 21 labeled four-leaf networks with a combinationof two different machine learning techniques. First, we use k-nearest neighbors to selectthe topology, and then use support vector machines to select the labeled network in thattopological class.

2 Phylogenetic Trees and Networks

Phylogenetics is the study of evolutionary relationships. Often, the goal of many researchersis to create a graphical model called a ‘tree’ to explain how a species, or a collection ofspecies, have evolved through time. This requires examining an array of aligned DNAsequences and using them as the basis for reconstructing the true paths of evolution forthe collection of species (Figure 1). The distribution of the observed site patterns alongthe aligned sequences can give insight as to when a given pair of species split away fromone another. While the researcher cannot see the original path that the species’ DNA took,they can infer when certain evolutionary events, such as speciation, may have taken place.

While some events, like speciation, can be easily modeled with a tree, other evolutionaryevents such as horizontal gene transfer (HGT) and hybridization need more complicatedmodels. HGT is the trading of genetic information between two organisms without engagingin sexual/reproductive activities [1]. This is often the case in eukaryotic cells and is oftenthe cause of antibiotic resistances [2]. Another of these events, hybridization, is where two

1

separate species sexually reproduce to create a new species. This phenomena is heavilydocumented, and while the most common example is the cross breading of plants, it takesplace in both animals and plants. For example, Carl Hubbs famously recorded hybridizationbetween subfish in his paper “Hybridization between Fish Species in Nature” in [3].

Figure 1: Famous diagram of the separation of thethree classifications of life [4].

Trees have been the traditionalform of representation for phylogeneti-cists and have been used in a number ofpapers. Woese and Fox famously usedthem to counter the formally held be-lief that eukaryotic cells were the re-sult of merging prokaryotic cells, butinstead only shared a common ancestorwith a third kind of life, archebacte-ria, in their paper “Phylogeneti struc-ture of the prokaryotic domain: Theprimary kingdoms” [4].

While there were other, smaller,publications leading up to this one, thepaper by Woese and Fox was the firstto gain national recognition, even be-ing featured on the front page of the

New York Times [4].However, these structures lack the complexity to model real world events like hybridiza-

tion and horizontal gene transfer. Now, networks as the main structure are becoming morepopular in biology as their more complicated structure allows for greater flexibility [5].

Figure 2: Example of a Phylogenetic network [6]

On a high level, the main differencebetween trees and networks is the pres-ence of reticulation edges (Figure 2).Reticulation edges (denoted by dottedlines) represent alternate evolutionarypaths and form cycles in the underly-ing graph. They come in pairs and asingle site can evolve along one of thereticulation edges in a pair, but notboth. The differences and definitionswill be discussed more in depth in Sec-tion 3.

In this thesis, we are working withcollections of aligned DNA sequences from 4 species, and our goal is to determine whetherthe species evolved according to a tree or a network structure. These sequences can begenerated by a simulator or be the result of DNA sequencing of real species. The sequences

2

are representative of currently living species, and we have no information about theirancestry. We are proposing a new algorithm for classifying collections of sequences into 21different labeled network and tree topologies.

Figure 3: The difference between trees anda 3-cycle network. The dotted edges are the‘reticulation edges’. [6]

The goal of reconstructing the evolu-tionary history of these DNA sequences canbe viewed as a missing data problem, wherethe living species are known quantities, andtheir common ancestors are missing values.

Historically, there have been a coupleof ways that this has been done for trees,such as quartet puzzling, neighbor joining,and maximum likelihood methods. All ofthese techniques depend on one of threetechniques: maximum parsimony, distance measurements, or model creation.

While there are an abundance of tree-based reconstruction techniques, there have beenvery few for network topologies [6]. In this thesis, we will explain our algorithm for selectingthe best fitting network for a given set of aligned DNA sequences. In order to do this, we willfirst use the aligned DNA sequences to compute the relative frequencies of observing each4-tuple of DNA bases, then we evaluate a collection of polynomials (called phylogeneticinvariants, see Section 3.3) at the relative frequencies [7]. We use the values of thesepolynomials to classify each set of aligned DNA sequences into their labeled topologicalcategory.

3 Markov models on trees and networks

Before we begin talking about the details of networks, we have to establish some definitions.First, lets cover the terminology found in introductory graph theory textbooks, startingwith a graph [5]. A graph G = (V,E) is collection of points V called nodes, or vertices,connected by a set of lines E called edges. Without specification these edges have nopreference for direction, so a directed graph is a graph where the edges are pointed in acertain direction from one node to another. Thus, for directed graphs, each edge in E isan ordered pair of vertices such that the pair (i, j) represents an edge from i to j. Everynode has a value called its degree which represents the total number of edges attachedto it. Formally the degree of a vertex of a graph is the number of edges incident to thevertex. For directed graphs, this definition can be broken down further into in-degree andout-degree, which represent the number of edges directed toward the vertex and away fromit, respectively.

Recall that the goal of this paper is to create an algorithm for the classification of 4aligned DNA sequences into certain network structures, but to understand how we classify,we must first discuss the structures themselves. In graph theory, a tree is a graph with

3

no cycles, and in phologenetics, often we are working to reconstruct structures calledphylogenetic trees. Phylogenetic trees are the most basic structure that we will work with(Figure 1), and they can be thought of as a visual reconstructions of the path that a species(or group of species) has taken during the course of their evolution. In a phylogenetic tree,all edges point away from a single special vertex called the root. This root is always unique,and has an out-degree of 2. The leaves of phylogenetic trees, nodes with in-degree= 1and out-degree= 0, represent the different organisms (or taxa) that we are attempting toreconstruct the history of, and the inner nodes represent ancestors of said taxa [8] [5].We can imagine our tree as a directed graph with all edges pointing away from the roottowards the leaves. Additionally, phylogeneticists often restrict themselves to the set ofbinary, or bifurcating, trees. In a binary tree, every vertex, other than the root or a leaf,has in-degree one and out-degree two. Finally, sometimes the branch lengths of the tree(the lengths of the edges of the graph) can represent time. However, in this project, weare interested in estimating just the graph structure and not the branch lengths, so thelengths of the edges in our trees (and eventually networks) are arbitrary and do not carrythis significance.

Tree structures are subsets of a greater phylogenetic structures called phylogeneticnetworks. These structures are defined as a rooted acyclic directed graph, without paralleledges that follow the following principals:

• The root has out-degree two.

• The only edges with in-degree one are the leaves.

• All other vertices have in-degree one and out-degree two, or in-degree two and out-degree one. [6]

Again, the main difference that separates phylogenetic trees from networks, is thepresence of reticulation edges and vertices. Reticulation vertices are representations ofreticulation events, like hybridization, or HGT. A reticulation vertex is a node with in-degree 2 and out-degree 1. Edges that are directed into reticulation vertices are calledreticulation edges (Figure 3). Trees cannot have these vertices, but networks may. Themain property that makes these reticulation vertices different from normal vertices is thata site may evolve by traversing one reticulation edge, or another, but not both. This meansthat, for a given site, its evolutionary history must only have one reticulation edge present.Examine Figure 3: every site will evolve according to T1 or T2, but not both. A single cyclenetwork is a network with a single reticulation vertex. Such networks will always splitinto two trees, creating a structural hierarchy; in essence, all single cycle networks can bethought of as mixtures of one or two tree topologies whose branch lengths are determinedby the original network. Keep this in mind as we begin to talk about classification, as thishierarchy will make separation of classes difficult.

4

In the statistical setting that we will be working in, the location of the root of a phyloge-netic tree is considered unidentifiable, meaning that the true location cannot be recoveredeven with perfect data. The best that can be done in this setting is the reconstruction of theunrooted tree topology. For phylogenetic networks, the best that can be reconstructed isthe semi-directed topology [6], which is the topology obtained by un-rooting a phylogeneticstructure and un-directing all non-reticulation edges. In this work, we will be representingour topologies in their semi-directed form.

It is also worth noting that while network structures are more versatile in their repre-sentations, they make the work of reconstruction more difficult. When one is consideringa tree, there are a finite number of n-leaf tress, when working with networks, there are aninfinite number of n-leaf networks. We fix this problem by only looking at a subset of thepossible networks, instead of the space of all networks.

All of the phylogenetic networks we consider in this work are tree-child networks. Thismeans that no child of a reticulation vertex is also a reticulation vertex. We further restrictto our attention to a subclass of tree-child networks called single cycle networks, or networkswith a single reticulation vertex. This makes sure that we are dealing with a finite numberof topologies when it comes to classification, making the use of classification algorithmslike KNN feasible.

Given a fixed tree or network, the evolution of each site can can be modeled as a Markovprocess on a tree or network. Lets begin by the math involved with fitting a tree model.

3.1 Tree-Based Markov Models

To visualize our problem, lets examine four sets of species with the following aligned snip-pets of DNA:

Species 1: ... C C C G T G G T G G C C ...

Species 2: ... A C C C T A G G A A C A ...

Species 3: ... G C C A T G G G G C C A ...

Species 4: ... A T A T T T G C A T C C ...

In phylogenetics, our goal is to reconstruct ancestral histories and the differences ineach site of the DNA are the keys to their reconstruction. Lets say, for example, that thelineage of the DNA site highlighted in red evolved following the tree in Figure 4.

We would describe the pathing of this tree as a split happening at the root, where one

ancestral species kept the C at the site with some probability P(1)C,C , and another mutated

that site into a G with probability P(2)C,G. We would then read the split for each subsequent

pair of species in a similar manner.In Figure 4, each edge e is labeled with its corresponding transition probability matrix

P (e). The i, jth entry of P (e), denoted P(e)ij , is the probability of transitioning from nu-

cleotide i to nucleotide j along edge e. We only have four possible options for sites, so any

5

C

C

C G

G

G A

P (1)

P (3) P (4)

P (2)

P (5) P (6)

Figure 4: Example evolutionary path of a site. P (i) is a 4x4 matrix that models thetransitional probabilities, or the probability of mutating the parent nucleotide into thechild.

given P (e) will take the form:

P(e) =

P(e)A,A P

(e)A,C P

(e)A,T P

(e)A,G

P(e)C,A P

(e)C,C P

(e)C,T P

(e)C,G

P(e)T,A P

(e)T,C P

(e)T,T P

(e)T,G

P(e)G,A P

(e)G,C P

(e)G,T P

(e)G,G

Since we assume that the mutations are independent events, the probability of observing

the states at each node as we did in Figure 4 is:

πC ∗ P (1)C,C ∗ P

(2)C,G ∗ P

(3)C,C ∗ P

(4)C,G ∗ P

(5)G,G ∗ P

(6)G,A

where πC is the probability of observing a C at the root.One thing we assume is that all of the sites in a set of aligned DNA sequences evolve

according to the same tree or network. So while this example is looking only at one site,every site is a data point that can help us reconstruct the structure of the same phylogeneticobject.

But that is just an example tree. In reality, we will not know what the nucleotidesof the internal nodes were. However, that will not stop us from being able to calculatethe probability of observing the sites that we have, given a certain tree. To do that, wesum the probabilities over all possible combinations of states of the internal nodes. In theexample tree above, it would mean summing the probability that the root started as A, C,G, or T, then transitioned from that root to any A, C, G or T for its children, then thateach child transitioned again to the sites we see today. Mathematically, the probability ofobserving C, G, G, and A at the four distinct leaves would be:

6

∑i∈{A,C,G,T}

∑j∈{A,C,G,T}

∑k∈{A,C,G,T}

πi ∗ P (1)i,j ∗ P

(2)i,k ∗ P

(3)j,C ∗ P

(4)j,G ∗ P

(5)k,G ∗ P

(6)k,A

This procedure allows use to fully describe a distribution on {A,C,G, T}4, which isparameterized by:

• A root distribution vector πk = {πA, πG, πC , πT }

• A transition matrix P (e) for every edge.

• A labeled tree.

Given a 4-leaf tree T , the model associated to T is the collection of all feasible distributionsgenerated in this manner.

In general, the set of models associated to a set of n-leaf trees are distinct (up to theplacement of the root). Thus, we can determine the best fitting tree based on which modelthe empirical distribution on {A,C,G, T}4 (i.e. the vector of relative frequencies) ‘lookslike’ it came from. Let’s re-do this example on a more friendly space in order to get abetter picture. Let’s not look at a 4-leaf tree, but instead at a claw tree (Figure 5).

For the 3-leaf claw tree, the probability of observing X1, X2 and X3 at the leaves isequal to the sum of all possible internal node paths. Since we are using a claw tree, thereis only one internal node, (the root) so our probability of observing X1, X2 and X3 is:∑

i∈{A,C,G,T}

πiM1i,X1M2i,X2

M3i,X3

= πAM1A,X1M2A,X2

M3A,X3+ πCM1C,X1

M2C,X2M3C,X3

+ ...+ πTM1T,X1M2T,X2

M3T,X3

Figure 5: Example of a claw tree with nucleotide leafs X1, X2, X3, root Y , and transitionmatrices M1, M2, M3.

7

Different papers use different transition matrices based on different biological schoolsof thought. Generally, biologists agree that DNA probably follows one of a number ofcomplex models, but for the sake of simplicity and example, we will be using the simpleJukes-Cantor model for our transition matrices.

3.1.1 The Jukes-Cantor model

The Jukes-Cantor model is the simplest mutation model possible, as it assumes that allmutations along a given edge occur at the same rate and are independent across all sites.Thus, if we focus on one site, the probability of mutation along an edge e is constant andthat mutation has an equal probability of resulting in any one of its other three nucleotide

options [5]. If we assign the probability of mutation as σ, and if P(e)i,j is the probability

of becoming nucleotide j starting with nucleotide i, then the Jukes-Cantor model can besummarized with the transition matrix:

P(e) =

1− σ σ/3 σ/3 σ/3σ/3 1− σ σ/3 σ/3σ/3 σ/3 1− σ σ/3σ/3 σ/3 σ/3 1− σ

The probability that n sites out of m mutate in a given generation is represented by

(σ/3)n(1−σ)m−n. While it is true that other models of mutation can be more representativeof the actual biological processes (the Kimura model, for example), the Jukes-Cantor modelhas the benefit of being highly interpretable and symmetric [9]. Because of its symmetry,it does not matter if we treat the trees as rooted or unrooted [6]. This is important, as thetree and network structures that we are trying to classify our data into are all unrooted.The values of σ are dependent on the length of the edge e. In this project, the package weuse to simulate trees [10] automatically calculates this values based on the branch lengthsof the tree we feed it.

3.2 Network-based Markov models

Network models are similar to tree models, only more complex. Recall that single cyclenetworks, networks with a single reticulation vertex, can be regarded as tree structureswith one additional edge. Due to this combinatorial structure, single cycle networks canbe thought of as mixtures of two tree models (one for each possible collection of evolution-ary paths a single site can take) with some mixing parameter determining how much ofeach composite tree contributes to the networks. Thus the math behind calculating theprobability of observing a given n-tuple in a network is very similar to that of the tree.Given a single cycle network and calling the two reticulation edges er1 and er2 , each site canonly follow either er1 or er2 down the network. Thus, if we assume that each site chooses

8

any one given edge with probability γ, then we can view the network distribution as amixture of the two composite tree distributions, where γ is called the mixing parameter.

Figure 6: The different colors rep-resent groups that share compositetrees. The lines represent a nest-ing. Note that 3 cycles share a com-mon topology amungst their com-posite trees and they only differ inbranch lengths.

Let N be a single cycle network on n leaves withcomponent trees T1 and T2. Let θ be a collection oftransition probability matrices, one for each edge ofN , and let θ1, θ2 be the collections of transition prob-ability matrices of θ that correspond to the edges inT1 and T2. If we let PT1,θ1 and PT1,θ1 be the distribu-tions on {A,C,G, T}n arising from the correspond-ing Markov processes, then the distribution arisingfrom N and the matrices θ is:

PN,θ = (γ)PT1,θ1 + (1− γ)PT2,θ2 .

If we fix N and allow the transition matrices θto vary, then we obtain the model associated to N .Note that each tree model is a submodel of a networkmodel if the tree is one of the network’s componenttrees. Again, we see this nested classification prob-lem that will make our task much more difficult. Ifwe were to allow γ to approach 1 or 0, then the model would approach one of its compo-nent tree models. A more in-depth look into the details and on the derivation of networkmodels can be found here [6]. The hierarchy of nestings between all possible topologies arevisualized in Figure 6.

3.3 Phylogenetic Invariants

One method for selecting the best tree is based on phylogenetic invariants. Given a treeand a DNA substitution model (e.g. the Jukes-Cantor model), a phylogenetic invariant isa polynomial that evaluates close to zero if the data came from the specified tree-basedMarkov model. The method of using phylogenetic invariants for tree selection was firstsuggested independently by Lake [11] and Cavender and Felsenstein [12] in the late ’80’s.

Returning to probabilities, assuming the Jukes-Cantor model, we will notice somethingabout the probabilities. Let’s use the claw tree and determine the probability of observing(A,A,A) and (C,C,C), using the Jukes-Cantor model:

P (A,A,A) =∑

i∈{A,C,G,T}

πiM1,i,AM2,i,AM3,i,A

P (C,C,C) =∑

i∈{A,C,G,T}

πiM1,i,CM2,i,CM3,i,C

9

Since we are using the Jukes-Cantor model, the transition matrices have a particular formand we only have to consider whether or not our nucleotides are changing, turning ourprobabilities into:

P (A,A,A) = πAσ1σ2σ3 +∑

i∈{C,T,G}

πi(1− σ1)(1− σ2)(1− σ3)

P (C,C,C) = πCσ1σ2σ3 +∑

i∈{A,T,G}

πi(1− σ1)(1− σ2)(1− σ3)

Furthermore, if we make the assumption that all nucleotide bases are equally likely in theroot distribution, then the probabilities become equal:

P (A,A,A) = P (C,C,C) = πσ1σ2σ3 + 3π(1− σ1)(1− σ2)(1− σ3)

P (A,A,A)− P (C,C,C) = 0

This extends to the 4-leaf trees as well, giving us linear polynomial equations such as:P (A,A,A,A) − P (C,C,C,C) = 0 or P (A,A,A,C) − P (G,G,G,C) = 0. These linearequations form 15 groups of equal probabilities called equivalence classes, and they appeardue to the symmetric nature of the Jukes-Cantor model. Because these relationships aretrue across all trees, they are used to reduce the dimension of the coordinates. Thereare also non-linear relationships that are only true for some trees and not others thatare used for classification. We can group these non-linear relationships and use them toform a set of polynomials that will evaluate to 0 if a quartet came from a certain tree ornetwork structure. Due to the work of [6], we know that we can find unique collections ofphylogenetic invariants for each of the 4-leaf semi-directed networks pictured in Figure 7.

Figure 7: All possible 4-leaf singlecycle phylogenetic networks. [6]

These collections of polynomials can be computedusing a computer algebra system such as Macaulay2[13]. For this project, rather than re-derive them,we will be using an existing collection computedfor [6]. The full list of these polynomials can befound on our Github: https://github.com/Travis-Barton/Phylogenetic-Research.

Remark. In phylogenetics, there is a standardtransform, called the Fourier-Hadamard transform,that is performed on the probability coordinates(P(A,A,A,A), P(A,A,A,C), etc) that makes themodel parameterization easier, and thus computing phylogenetic invariants easier. Thistransform is a linear change of coordinates from probability space into Fourier coordinates,also referred to as q coordinates in the literature. We will use this standard transformation

10

and will refer to a transformed vector of relative frequencies as a q-vector. The descrip-tion of this transform is out of the scope of this thesis, however, for further reading onthe Fourier-Hadamard transform and group-based models, please see Seth Sullivant’s bookAlgebraic Statistics.

4 Supervised Learning

Now that we have discussed the various structures and data that underlay the background ofthis project, we can begin to discuss the classification methods. Since we are incorporatingtraining data into our model, we can use a class of machine learning called supervisedlearning. Supervised learning, in statistics, is the creation of a model using training datawhere the correct output is known. This model can be used to predict the output ofnew data where the correct output is unknown [14]. As one obtains more/better data,one should expect their model to improve, thus the ‘learning’ aspect. The models madefrom supervised learning can be used for any number of uses from regression, classification,variable reduction, and more [15] [16]. For this paper, we will be creating a classificationmodel, so we will only focus on that side of supervised learning.

4.1 Classification

Classically, classification is the creation of a classification rule, and data can be placed intorespective categories based on the application of this rule [14]. This rule can be model-based, like in neural networks and logistic regression, or they can be geometrically-basedlike in support vector machines, k nearest neighbor and random forests. Each method hasits own advantages and disadvantages, and are dependent on the type/availability of data.We will be focusing on the two that are used for our problem, support vector machines(SVM) and k-nearest neighbor (KNN).

The problem we face is such: Given a number of aligned DNA sequences of unknownstructure, and a number of training points with known topology, what phylogenetic struc-ture and topology does each quartet fall into?

While the original sequences obtained are categorical, we transform the data into trun-cated continuous sequences of length 15 and 406; the sequences of length 15 correspondto the transformed relative frequencies of each 4-tuple, while the sequences of length 406correspond to the evaluation of the 406 phylogenetic invariants on the transformed relativefrequencies. One of the properties of these units is their geometric significance, meaningthat their locations in their respective N dimensional spaces are significant in terms oftheir inherit properties.

4.2 K-Nearest Neighbors

11

Figure 8: Using KNN, and settingk = 3, the pink point would be clas-sified as blue, as there are 2 bluesand 1 green as its 3 closest neigh-bors.

K-nearest neighbor (KNN) is a classification tech-nique that makes classification decisions based onthe classes of the data around the point in question(figure 8). This method is non-parametric, and thusbuilds no model what so ever. The only require-ments of this technique are the presence of trainingdata from which to compare test points to, a mean-ingful distance metric, and a number of points (k) toexamine (note that k ≤ N where N is the total num-ber of training observations [17]). Distance can bedefined in many ways, such as Euclidean distance,taxicab/Manhattan distance, cosine similarity, andmore, but for this project, we will be using Euclideandistance.

The full algorithm is described in 4.2.1, but inshort, the distances between the test point and alltraining points is computed, and then the labels ofthe nearest k training points are used to determinethe label of the test point via a majority vote. Tiesare broken randomly.

KNN has a number of advantages, the first being its interpretability. It is easy tounderstand that a certain point was classified as a certain class because its neighborswhere also that class. It is by far the most intuitive/simple machine learning technique.Also, since it is not model based, it only requires the presence of training data. As such,there is no need to adjust pre-made models to fit new data, and there is only one parameterto tune (k) [14].

Unfortunately, some of these advantages can also be disadvantages. Because there isno model, every new run of the algorithm requires a nearest neighbor search. If there is alarge training set or a large testing set, then this can quickly become unfeasible/tedious.There are some methods that can greatly shorten this time, kd-tree and cover-tree searchesstructure the data in such a way that makes it not necessary to measure the distances ofall points, but only a select few, but any nearest neighbor search will be a slow process [18][19]. To shorten our algorithm’s complexity, we use a kd-tree search implemented by theFNN (Fast Nearest Neighbor) package in R [20].

4.2.1 KNN Algorithm

The formal algorithm for KNN is as follows:

12

Algorithm 1: General Structure of KNN Algorithm

1 Input: Test data, training data;2 Output: Classified Test Points (Classes);3 Dist = Distance Matrix of Test data to Training data;4 Classes = Returned predicted classes;5 for i to n do6 neighbors = order(Dist[i,1:k]);7 Classes = list(Classes, vote(neighbors));

4.3 Support Vector Machines

Figure 9: Using SVM, we are ableto determine the optimal linear de-cision boundary to separate our twoclasses. Points above the line areclassified as blue, while the pointsbelow are classified as red

SVM is a classification technique whose goal is tofind the optimal separation boundary (called themargin) between groups in training data. It can alsobe used for regression, but since we are using it fora classification problem in this paper, we will onlyfocus on that.

As an example, lets imagine we have two classes.We will describe how to expand that idea to multi-class classification later on. Given a data set x andthe classifications of each observation in x, c, sim-ple SVM compares all hyperplanes that follow theseparation rule w · x + b where w · x ≥ 1 for allc1 and w · x ≤ −1 for all c2 (figure 9). Becauseof these constraints, our margin (or space betweenclasses) will be exactly 2

||w|| . More space means thethe classes will be easier to separate, so our goal willbe to minimize w [17].

We can find this hyperplane via Lagrangian mul-tipliers. The goal of SVM can be summarized intothe following [17, 21]:

minw,b

||w||2

2, where yi(w · xi + b) ≥ 1,

In the above, yi is +1 for class 1, and -1 for class 2 for any given x, and w is an orthogonalvector to our hyperplane.

Thus our Lagrangian problem becomes:

L =||w||2

2−∑i

αi[yi(w · x + b)− 1]

13

∂L

∂w= w −

∑i

αiyixi = 0→ w =∑i

αiyixi

Because of this, we know that w is a linear sum of the samples.

∂L

∂b= −

∑i

αiyi = 0→∑i

αiyi = 0

Plugging these values back into our Lagrangian equation we have:

L =1

2

(∑i

αiyixi)(∑

j

αjyjxj)−

∑i

αyixi(∑

j

αjyjxj)− b

∑i

αiyi +∑i

αi

Which can be simplified to:

L =∑i

αi −1

2

∑i

∑j

αiαjyiyjxixj

This is very important, as it tells us that out optimization depends only on combinationsof the data. What is more, our constraints describe a convex region, and our problem is aquadratic optimization problem. Thus, there must be a unique global solution. This globalsolution will appear where:

maxα1,...,αn

∑i

αi −1

2

∑i

∑j

αiαjyiyjxixj , [21]

This solution has limits. It is only applicable to binary classification problems wherethere is a linear separation boundary. To allow for outliers, a technique called ‘soft marginSVM’ is used. In this technique, allow violations of the constraints in minor amounts. Ournew optimization problem becomes yi(w · xi) + b ≥ 1− εi where εi ∈ (0, 1). Points whereεi = 0 are points that lie in the ideal side of our margin, while points where εi > 1 are onthe far side of the margin. Points where εi ∈ (0, 1) are points that are still on the rightside of the hyperplane, but inside of the margin. Much of the math is the same (or verysimilar), so further reading on soft margin SVM can be found here [17].

The problem of converting binary SVM to multiclass SVM can be solved with twomethods, one-vs-one or one-vs-all. One-vs-one involves performing binary SVM on eachclass in a pairwise manner, and making classification decisions based on a voting scheme.One-vs-all involves isolating one class at a time and performing SVM to separate that classfrom all other points in the data, it then also implements a voting scheme to decide finalclassifications [17]. The package that we use for SVM (e1071) implements a one-vs-oneapproach [22].

14

5 The Decision Algorithm

The code required to simulate, train and implement our model, including examples, can befound on the GitHub repository https://github.com/Travis-Barton/Phylogenetic-Research.The specifics of the algorithm are listed in Section 5.2, but on a high level, our algorithmfirst uses KNN to determine if each test point is a tree, 3-cycle or a 4-cycle. This is done us-ing the q-vectors of each data point, that is, the transformed vector of relative frequencies.Our default k is set to 3 due to empirical results. We use a simple majority-rule votingscheme (ties are broken randomly) with no kernel. Once a structure has been established,we transform every test point into their invariants score vector. Recall that the invariantsscore vector is the list of values obtained by evaluating every phylogenetic invariant for alltwenty-one structures on the data point. We then feed the invariants score vector into theappropriate pre-trained SVM model (one for each topology, i.e. tree, 3-cycle, and 4-cycle).All three SVM models are trained using our pre-loaded data (see 5.3), a soft margin valueof 1, and a linear kernel. The result from the SVM model will be our final classification.The final classification value decides whether a given point is not only a tree, 3-cycle, or4-cycle, but also which specific type of said topology (i.e. labeling) as well. There are 3different types labeled tree topologies, 6 different labeled 3-cycle topologies, and 12 differ-ent 4-cycle labeled topologies. All in all, that makes 21 different possible classifications.All 21 of these structures can be seen in Figure 7.

5.1 Data Structure

Once simulated and transformed, our sets of aligned DNA sequences will take the formof q-vectors. These q-vectors (see Section 2.3.2) will be length 15 vectors, but will bedecreased to length 14 when used, due to the fact that after the Fourier transform the firstcolumn is constant, and thus useless for classification. The creation of the test data in oursimulations requires the following parameters:

Parameter DescriptionValues in Standard

Training Set

Branch LengthsThe range of possible lengths thatthe edges connecting our nodes can

take.BL ∈ (.1, .4)

γ

The mixing parameter. Itdetermines how much of each

composite tree makes up anetwork.

γ∈ [.4, .6]

SitesThe length of each nucleotide

sequence.Sites = 1,000

SizeThe number of aligned sequences to sample

from each labeled topology.Size = 1,000

15

There are 5 edges in a four-leaf tree and thus, 5 branch length parameters are chosen inthe tree case. For 3-cycle and 4-cycle networks, there are 8 edges in each network, and 8parameters are chosen. These branch lengths are used to calculate the α and β parametersfor each transition probability matrix along each edge.

5.2 Algorithm Structure

The formal structure of our algorithm is as follows:

Algorithm 2: Classification Algorithm for Phyogenetic Structures

1 Input: Test point, Training Data, Tree SVM model, 3 Cycle SVM model, 4 CycleSVM model;

2 Output: Classified sequences;3 Determine if structure of test point is Tree, 3 Cycle or 4 Cycle with KNN;4 Convert test point into its invariant score*;5 if structure = Tree then6 Determine topology with Tree SVM model;7 else if structure = 3 cycle then8 Determine topology with 3 Cycle SVM model;9 else

10 Determine topology with 4 Cycle SVM model;* Once a quartet has been translated into P-Coordinates, we can plug it into our phylogenetic invariants, and its output willbecome our ‘invariants score’. This invariants score will be what we use in our decision algorithm to determine which specific

topology each data point falls into, once it has been determined to be a tree, 3 cycle or 4 cycle network.

5.3 Training Data

Before we run our data through the algorithm, we first use the above parameters to simulatea set of robust training data that will hold for most of the sequence types we might see. Todo this, we used the R package Phyclust and our own code to generate training data fromtrees, 3 cycles, and 4 cycles, in such a way that our algorithm still works with extremebranch lengths, heavily biased gammas and low sites [10]. See results section for details onhow we determined this range. Since there was no prior method for simulating networks, wecreated our own. By setting a range for the branch lengths and using phyclust to simulatesequences from the network’s component trees, we were able to generate sequences fromboth 3 cycle and 4 cycle networks [10]. Again, the code for this is on our GitHub.

Our method of establishing a default training set with a broad enough range of pa-rameters so that extreme values and mundane values are identifiable, and hard encodingit into our package has the large benefit of being user-friendly, but it also comes at a cost.Intentionally inputting outliers and extreme scenarios hurts our overall accuracy (as there

16

are more training points in the boundary areas where there should be hard margins) it will,however, aid our accuracy in the extreme cases. This trade off of robustness for baselineaccuracy is something we feel is worth the cost.

6 Simulations

To determine what the baseline training data should be, we decided to compare results witha wide array of conditions and compare an contrast until we found an optimal parameterset. All results that follow where run 50 times, with their average accuracy, and variancesrecorded.

6.1 Optimal Conditions

Without dispute, our model runs the best when the training data matches the testing data.This is not surprising in the least and is a common idea in statistics and machine learning.When we consider all parameter spaces, however, our algorithm with matching test dataperforms best under the following parameters:

BL (.1, .3)

Gamma .5

sites 1000∗

size 2500∗

* The larger the size and sites,the more accurate the model.1000 and 2500 were chosen

respectively as ‘large enough’values.

When we test our algorithm using the above training data parameters, we obtain theconfusion matrix in Figure 10.

6.2 Sub-Optimal Conditions

So using an accuracy of 84% as a baseline of success, lets examine the success as we varyparameters away from the optimal training data. The graphs that follows are the resultsfrom individually changing the following parameters respectively. Keep in mind that atthis point, the training data parameters still match the testing data parameters.

Figure 11a shows what happens when we change the mixing parameter (γ) from .5 to(.25, .75). You will see a lot of confusion, and accuracy drops significantly. This is be-cause the networks are no longer able to distinguish between composite trees, and networkstructures. This is reasonable, as a network that consists of 75% one tree will very muchresemble its component tree.

Figure 11b shows what happens when the branch length parameter is broadened dras-tically (.05, .4). Allowing it to cover a wider breadth, and allowing it to veer near 0, will

17

Figure 10: The results from ideal training data: Accuracy = 84%, Variance = 1.35 · 10−5

disturb its ability to draw conclusions of the overall structure, thus demonstrating moreconfusion between topologies.

Figure 11c shows what happens when the sites are lowered from 1000 to 100. The sitesact as a sort of resolution. The more sites, the more continuous our relative frequenciesappear. When we lower the sites, we end up with a lot more binning of results and lessdistinction. This results in confusion across both topologies and substructures and is whyyou see such bad results in the figure.

Figure 11d shows what happens when we decrease the training data from size 2500 pertopology, to 100 per topology. Now instead of 2500 training samples to learn the shapeof trees, 3-cycles and 4-cycles, our algorithm and pretrained SVM models only have 100examples to work from. This will result in a less fitted model. While this will preventover-fitting, 100 is too low to accurately develop the proper picture needed to capture theintricacies of the data.

18

(a) Changing Gamma, acc = 72%, var =

8 · 10−4

(b) Differing Branchlengths, acc = 62%,

var = 2.17 · 10−5

(c) Lowered sites, acc = 52%, var = 1.8 ·10−5

(d) Lowered training data size, acc = 76%,

var = 2.513 · 10−5

Figure 11: Results from the above test parameters.

6.3 Robustness

When the training data differs from the testing data, the results can quickly become non-sensical. the following graph shows that happens when we use the following robust trainingparameters:

Branch Lengths (.1, .4)

Gamma (.4, .6)

Sites∗ 1000

Size∗ 2500* Reducing sites and size does not make sense

for model robustness, as less training data that isless refined can only hurt the model. Thus, Gammaand Branch Lengths are the only parameters that

contribute to robustness.

19

But testing data that looks like so:


Gamma (.25, .75)

Sites 1000

Figure 12: Accuracy and Varaince are near 0, the training data differs from the testingdata.

The results (Figure 12) are even worse when we try to use our ‘optimal training data’.It is worth noting, that as far as errors go, mistaking networks for trees is preferable to thealternative in the phylogenetic world (see Discussion).

Similarly if we allow the branch lengths in the testing set to vary away from the trainingdata on the other side, say like so:


Gamma (.25, .75)

Sites 1000

20

Figure 13: Accuracy is around 50% and the variance is slightly higher than 0, the trainingdata differs from the testing data.

Then we see errors form in the other side of the spectrum, where the algorithm defineseverything as a 3-cycle or a 4-cycle, and neglects the trees entirely (Figure 13).

Because we have to expect a fair amount of variability in the data, we will programthis more robust training set into our package. While it may slightly hurt accuracy whenthe data follows a perfect form (short branches inside our range, and gamma near .5) itwill greatly increase our accuracy when we face outliers and extreme values.

7 Discussions

Before we get into the analysis of the parameter adjustments, attention must be drawn tothe two symmetric patterns of error that persist throughout the results (Figure 14). Theseerrors are to be expected, as the networks that are being confused have heavy overlap.The blue highlighted portions, for example, display 4 cycle networks being confused withother 4 cycle networks, but these networks share a composite tree topology, and thus will

21

naturally be close together in space. Mistakenly distinguishing a small number of themwith each other is an almost unavoidable consequence of the data’s structure.

Figure 14: The green highlights the3 cycle errors, and the blue high-lights the 4 cycle errors.

The same is true with the green highlighted por-tion of 3 cycles. Those two arches are 3 cycles whoalso share composite tree topologies, and thus area natural mistake that should be expected to somedegree. Figure 6 highlights the relationships of net-works and their composite tree structures.

Further notice should be paid the fact that thetree space is almost always perfectly classified. Onceour algorithm correctly decides that a data point isa tree, it very seldomly makes a mistake as to whichtree topology it is. Figure 15 shows that even whenthe tree has minuscule branch lengths, the model isstill able to determine the correct topology, imply-ing that the task of tree distinction is its easiest task.It is also worth noting that our algorithm is hierar-chical in the sense that once it decides on a certainstructure, it cannot change its mind. If a tree is missclassified as a 3 or 4 cycle, there is no saving thatpoint. Our early work did explore the use of a classifier that was not hierarchical, butempirical results quickly showed that such an algorithm lowers our overall accuracy. Thereis some mystery as to why that is, and our best guess for now is that there is inheritinformation when considering strictly topologies and then overall structures that is lostwhen combining the two. Perhaps there is some overlap that can be broken apart in thethe space of structures that cannot be distinguished in the strict topology space. Thisquestion is unanswered, however, and will be a focus of future work.

During this project, we originally had planned on one single method as our classifier,but throughout early testing, we consistently obtained results that suggested that KNNusing the q-vectors was the best method of obtaining the general classification of the data.There is some mystery as to why this is, but we have a few theories. First, we believethat the q-vectors are grouped in such a way that makes separation hyperplanes difficult,these results persisted even when we attempted kernel SVM, suggesting a very odd, andperhaps disjointed, shape. Because of KNN’s simplicity, it may be able to answer thesequestions of membership more easily than other rule-based methods like SVM or RandomForest (a candidate that was determined inaccurate early on in our work). We obtainedcomparable results with KNN and SVM when it came to topology distinction, but dueto SVM being model-based and interpretable, we decided to only use KNN when it wasclearly the better choice. Doing this reduces run-time dramatically, as there is no need forseveral consecutive nearest neighbor searches. As to why KNN with the q-vectors worksthe best, that is still an unanswered question that will be a focus of future work.

22

Figure 15: In the above test, weran sequences generated from tree1 through our SVM model for treeswhile varying its branch lengths. Itonly did poorly when either bothcontrol parameters where near 0, orwhen the branch length pertainingto the root was near 0.

Another question might arise about why we needthe invariant scores at all. While the q-vectors area standard practice in phylogenetics, there is strongtheoretical evidence that invariant scores should be-have in such a way that makes classification easy [6].The fact that they should evaluate to zero given theproper topology should arrange themselves in sucha manner that makes a linear separation boundarycompletely feasible. While these scores do slightlyoutperform the q-vectors in terms of distinct topol-ogy classification, they do not perform as much bet-ter than we would have thought. There are manyreasons why this might be, but that is outside thescope of this paper. As new methods of calculat-ing/treating these invariant scores appear, our algo-rithm will be updated to reflect those new advances.

Finally, the decision of how to set our parametersfor the embedded training data was purely empirical.The above results show that we have established theoptimal range for both the training and testing set,but since we will not be able to pick the testing set,we decided to structure our parameters in such away that they still returned reasonable results evenwhen the testing data differs by a reasonable amount. This is in an attempt to make thealgorithm as robust as possible, but there will still be areas where it cannot perform well.Luckily, those areas are the complete extremes. For example, our algorithm will not dowell if the branch lengths of a given set of trees are extremely small. But if there areextremely small branch lengths then the data must very much resemble a point in spacerather than a tree, and will not be given much opportunity to mutate. Thus obtaininga phylogenetic tree with such a structure will be highly unlikely. The algorithm also willperform poorly under the presence of extremely biased mixing parameters. This is less ofa problem than is seems, however, due to the fact that our networks are mixtures of treemodels. If a network model has extremely biased mixing parameters, than it will behavevery much like a tree structure, and thus can be reasonably approximated by its tree withthe high contributory role. Thus a improper classification will not likely lead to a badmodel approximation in terms of practical use.

I would like no note here, that our choice of sites was a matter of computationalefficiency. If we were able to use a high powered computer consistently, and were notconstrained by time, we would use sites that are much higher than 1000. We choose thatvalue as it is the smallest value of sites that do not cause a drastic drop-off in accuracy(Figure 16).

23

8 Conclusions

Figure 16: The accuracy of our testswhere the training and test data fol-low our ‘optimal parameters’ andour sites are increased from 100 to10000.

Our algorithm is capable of classifying aligned DNAsequences into one of 21 different categories spreadout over 3 different network topologies. It doesso with two different machine learning techniques(KNN and SVM) and comes with a built-in train-ing data generator so that no labeled training datais needed. Our built in generator’s parameters arechosen with care, so that it is robust to outliers, andextreme values. In the future, we would like to im-prove this algorithm by allowing for the classificationof networks with multiple reticulation vertices as wellas our current models. We would also like for the al-gorithm to not depend on KNN, as this method isexpensive in terms of both computation and storagesize. We would also like to dive deeper into our al-gorithm’s background, and understand why certainattributes behave contrary to how we would expect.Finally, we would like to create an R package onCRAN so that biologists and phylogeneticists alikeare able to use it with ease.

References

[1] P. J. Keeling and J. D. Palmer, “Horizontal gene transfer in eukaryotic evolution,”Nature Reviews Genetics, vol. 9, pp. 605 EP –, 08 2008.

[2] V. Kubyshkin, C. G. Acevedo-Rocha, and N. Budisa, “On universal coding events inprotein biogenesis,” Biosystems, vol. 164, pp. 16–25, 2018.

[3] C. L. Hubbs, “Hybridization between fish species in nature,” Systematic zoology, vol. 4,no. 1, pp. 1–20, 1955.

[4] N. R. Pace, J. Sapp, and N. Goldenfeld, “Phylogeny and beyond: Scientific, histori-cal, and conceptual significance of the first tree of life,” Proceedings of the NationalAcademy of Sciences, vol. 109, no. 4, pp. 1011–1018, 2012.

[5] M. OpenCourseWare, “16. learning: Support vector machines,” Jan 2014.

[6] E. Gross and C. Long, “Distinguishing phylogenetic networks,” SIAM Journal onApplied Algebra and Geometry, vol. 2, no. 1, pp. 72–93, 2018.

24

[7] E. S. Allman and J. A. Rhodes, “Phylogenetic invariants,” 2007.

[8] E. S. Allman and J. A. Rhodes, “Molecular phylogenetics from an algebraic view-point,” Statistica Sinica, pp. 1299–1316, 2007.

[9] J. P. Huelsenbeck, “Performance of phylogenetic methods in simulation,” Systematicbiology, vol. 44, no. 1, pp. 17–48, 1995.

[10] W.-C. Chen, Overlapping Codon Model, Phylogenetic Clustering, and Alternative Par-tial Expectation Conditional Maximization Algorithm, 2011.

[11] J. A. Lake, “A rate-independent technique for analysis of nucleic acid sequences: evo-lutionary parsimony.,” Molecular biology and evolution, vol. 4, no. 2, pp. 167–191,1987.

[12] J. A. Cavender and J. Felsenstein, “Invariants of phylogenies in a simple case withdiscrete states,” Journal of classification, vol. 4, no. 1, pp. 57–71, 1987.

[13] D. R. Grayson and M. E. Stillman, “Macaulay2, a software system for research inalgebraic geometry.” Available at http://www.math.uiuc.edu/Macaulay2/.

[14] M. A. Nielsen, Neural Networks and Deep Learning. Determination Press, 2015.

[15] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: theory andapplications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–501, 2006.

[16] R. Caruana and A. Niculescu-Mizil, “An empirical comparison of supervised learningalgorithms,” in Proceedings of the 23rd international conference on Machine learning,pp. 161–168, ACM, 2006.

[17] G. Chen, “Lecture slides in math 250 classification,” February 2018.

[18] J. L. Bentley, “Multidimensional binary search trees used for associative searching,”Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.

[19] K. L. Clarkson, “Nearest-neighbor searching and metric space dimensions,” Nearest-neighbor methods for learning and vision: theory and practice, pp. 15–59, 2006.

[20] A. Beygelzimer, S. Kakadet, J. Langford, S. Arya, D. Mount, and S. Li, “Fnn: Fastnearest neighbor search algorithms and applications,” 01 2013.

[21] M. OpenCourseWare, “16. learning: Support vector machines,” Jan 2014.

[22] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch, e1071: Misc Func-tions of the Department of Statistics, Probability Theory Group (Formerly: E1071),TU Wien, 2019. R package version 1.7-0.1.

25

http://www.math.uiuc.edu/Macaulay2/

thesis: inferring phylogenetic networkswbbpredictions.com/wp-content/uploads/2019/05/travis...they...

Documents