phylogenetic analyses using protein sequences.opperd/chapter8/chapter8.pdf · the tree was...

34
1 Chapter 8: Theoretical part Phylogenetic analyses using protein sequences. Fred R. Opperdoes Research Unit for Tropical Diseases, C. de Duve Institute of Cellular Pathology and Université catholique de Louvain, Avenue Hippocrate, 75, B-1200 Brussels Email: [email protected] 8.1. Introduction Phylogenetic analyses using ribosomal RNA (rRNA) sequences, as initiated by Woese and collaborators (Woese and Fox, 1977), suggest that the living world is divided into three domains: Eukaryota, Archaebacteria and Eubacteria (Fig. VIII.1). According to this so-called "universal tree of life" and based on morphological and biochemical evidence it was originally inferred that the earliest eukaryotic cells would have been Archaezoa, i.e. amitochondriate organisms adapted to an anaerobic life style, such as the extant diplomonads (e.g. Giardia ), Parabasalia (e.g. Trichomonas ) and the Microsporidia (Cavalier-Smith, 1993). Mitochondria then would have been acquired at a later stage from a bacterial endosymbiont belonging to the group of the α-proteobacteria. Also according to this tree the Euglenozoa, comprising trypanosomatids, bodonids and euglenoids, were the first group to have acquired a mitochondrion via endosymbiosis. Eubacteria Animals Fungi Eukaryota Plants Protists Microsporidia Trypanosomatidae Diplomonads Euglena Parabasalia Archaebacteria Algae Ciliates Fig. VIII.1. Tree of life based on 16-18S rRNA sequences. Thus, Archaea may have more similarity to Bacteria or Eucarya than both of them have to each other, in good agreement with the finding that Archaea exhibit a mixture of eucaryal and bacterial traits at the molecular level. This kind of tree often has been called "universal tree of life". (modified from Patterson and Sogin, 1993)

Upload: others

Post on 21-Nov-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

1

Chapter 8: Theoretical part

Phylogenetic analyses using protein sequences.

Fred R. OpperdoesResearch Unit for Tropical Diseases, C. de Duve Institute of CellularPathology and Université catholique de Louvain, Avenue Hippocrate, 75,B-1200 BrusselsEmail: [email protected]

8.1. Introduction

Phylogenetic analyses using ribosomal RNA (rRNA) sequences, asinitiated by Woese and collaborators (Woese and Fox, 1977), suggest thatthe living world is divided into three domains: Eukaryota, Archaebacteriaand Eubacteria (Fig. VIII.1). According to this so-called "universal tree oflife" and based on morphological and biochemical evidence it wasoriginally inferred that the earliest eukaryotic cells would have beenArchaezoa, i.e. amitochondriate organisms adapted to an anaerobic lifestyle, such as the extant diplomonads (e.g. Giardia), Parabasalia (e.g.Trichomonas) and the Microsporidia (Cavalier-Smith, 1993).Mitochondria then would have been acquired at a later stage from a

bacterial endosymbiont belonging to the group of the α-proteobacteria.Also according to this tree the Euglenozoa, comprising trypanosomatids,bodonids and euglenoids, were the first group to have acquired amitochondrion via endosymbiosis.

Eubacteria

Animals

Fungi

Eukaryota

Plants

Protists

Microsporidia

Trypanosomatidae

Diplomonads

Euglena

Parabasalia

Archaebacteria

Algae

Ciliates

Fig. VIII.1. Tree of life based on 16-18S rRNA sequences. Thus, Archaea may have moresimilarity to Bacteria or Eucarya than both of them have to each other, in good agreementwith the finding that Archaea exhibit a mixture of eucaryal and bacterial traits at themolecular level. This kind of tree often has been called "universal tree of life". (modifiedfrom Patterson and Sogin, 1993)

Page 2: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

2

However, a major problem associated with this “tree of life” is that thetrunk of early eukaryotic evolution, as well as many of the protistbranches, are much longer than in any other part of the tree. This can beexplained by a sudden increase in size of the rRNA from the 16S bacterialtype to the 18S eukaryotic type in order to accommodate additionalproteins when the large ribosomal subunit increased in size from 50S to60S. It is a well known fact that all tree construction methods are sensitiveto the so-called long branch attraction phenomenon, placing longbranches preferentially together with the outgroup towards the bottom ofthe tree. This has led to the suspicion that the part of the rRNArepresenting protist evolution may not reflect the true evolutionaryevents. Indeed, during the last several years phylogenetic methods usingprotein-coding genes have clearly demonstrated that molecularphylogeny based on rRNA does often not delineate phylogeneticrelationships between domains or between major lineage’s of thesedomains. Especially in the case of the protists this would render the use ofrRNA sequences highly unsuitable.

In recent years an increasing number of protein sequences fromarchaeabacteria, eubacteria and eukaryotes have become available and itseems that the so-called housekeeping proteins of the eukaryotes havenot undergone such a sudden evolutionary drift. Analyses based on theseprotein sequences suggest a massive and rapid radiation of protists, algae,fungi and animals almost simultaneously (Keeling and Doolittle, 1996;Philippe and Adoutte, 1998). Much of the new information obtained withthese protein sequences also contradicts the original idea that theamitochondriate Archaezoa were the first eukaryotes on Earth (Germot etal, 1996, 1997; Germot and Philippe, 1999; Philippe and Germot, 2000).

Although rRNAs have been very useful in unravelling the phylogeneticrelationships of organisms, many questions remain unsolved. Thereforea valuable alternative for the construction of phylogenetic trees is the useof protein-coding sequences and the above section has already illustratedthat interesting answers can be obtained from proteins as well. In fact inthis chapter it is demonstrated that protein sequences are equallypowerful in unravelling the affiliations of organisms and can do so oververy long time spans.

8.2. Arguments in favour of protein sequences rather than DNAsequences

A phylogenetic tree based on phosphoglycerate kinase sequences from allmajor kingdoms, i.e. Animalia, Fungi, Plantae, Protista, Eubacteria andArchaebacteria, illustrates the possibility to create alternative "trees oflife" from protein sequences (Fig. VIII.2.) In the following sections anumber of arguments are being developed as to why it may beadvantageous to use protein-coding sequences rather than nucleic acidsequences for the construction of phylogenetic trees.

Page 3: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

3

Fig. VIII.2. The tree was constructed from representative phosphoglycerate kinase proteinsequences from all major kingdoms, i.e. Animalia, Fungi, Plantae, Protista, Eubacteria andArchaebacteria. Note that the wheat sequence represents the chloroplast PGK whichclusters with the bacteria.

8.2.1. Why protein sequencesSince it is the DNA that contains all the information to create functionalproteins it is often thought that also the DNA should be used inmolecular evolution studies. However, there are many reasons why itmay be more appropriate to use protein sequences for such analyses. Thefundamental building blocks of life are proteins. The catalysts of virtuallyall of the chemical transformations in the cell, are proteins. Thefunctional properties of proteins is determined by the sequence of the 20amino acids. In many cases proteins are largely self-folding. For protein-encoding genes, the object on which natural selection acts is the proteinitself and not the DNA. The underlying DNA sequence reflects thisprocess in combination with species-specific pressures on DNA sequence(like the need for thermophiles to have DNA that is resistant to meltingor a very high or low GC content). Thus if function demands that aprotein maintains a specific sequence, there still is sufficient room for theDNA sequence to change.

8.2.1.1. The Genetic CodeThe most important argument to use protein rather than DNA sequencesis that the information present in the genes is interpreted only via theintermediate of the genetic code. Four different nucleotides taken three ata time can result in 64 different possible triplet codes; more than enoughto encode 20 amino acids (Table VIII.1). One codon ATG, representing

Page 4: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

4

Table VIII.1. The Genetic CodeThe relationship between the codons of nucleic acids, and the amino acids for which theycode, is embodied in the Genetic Code, (slight variations of it are found in protists, inmitochondria and in chloroplasts). The 64 possible triplets of bases in a codon, and theamino acid coded for are shown.

First Second Position ThirdPosition ------------------------------------ Position | U(T) C A G | U(T) Phe Ser Tyr Cys U(T) Phe Ser Tyr Cys C Leu Ser STOP STOP A Leu Ser STOP Trp G

C Leu Pro His Arg U(T) Leu Pro His Arg C Leu Pro Gln Arg A Leu Pro Gln Arg G

A Ile Thr Asn Ser U(T) Ile Thr Asn Ser C Ile Thr Lys Arg A Met Thr Lys Arg G

G Val Ala Asp Gly U(T) Val Ala Asp Gly C Val Ala Glu Gly A Val Ala Glu Gly G

methionine and also serving as the initiation codon, represents the startof a protein; every other amino acid may be encoded by 1 to 6 differenttriplet codes, and finally 3 of the 64 codes, called stop (or termination)codons, specify "end of peptide sequence". Where multiple codons specifythe same amino acid, the different codons are used with unequalfrequency depending on the nature of the gene and its level of expressionand this distribution of frequency is referred to as "codon usage". Codonusage varies widely between species.

8.2.1.2. Codon biasAmino-acid codons have been degenerated with wobble in the thirdposition. For instance the amino acids leucine, serine and arginine areencoded by 6 triplets. Valine, proline, threonine, alanine and glycine areeach encoded by four triplets. Since the base composition of the DNA fororganisms may vary, not all organisms have the same codon preference.Yeasts, protists, and animals all have different codon preferences, and as aconsequence the same protein sequence would in each of these organismsresult in differences in DNA sequence which are related to codon bias andnot to evolution. Also, some protists use the codons TAA and TGA toencode glutamine, rather than STOP and in mitochondria the codon TGAencodes tryptophane, rather than STOP. The inclusion of unique codonsin a subset of the sequences will tend to make that subset appear more

Page 5: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

5

Table VIII.2. Single and 3-letter codes for amino acids.All proteins are polymers of the 20 naturally occuring L-amino acids. They are listed herealong with their abbreviations :

Alanine Ala A Cysteine Cys C Aspartic AciD Asp D Glutamic Acid Glu E Phenylalanine Phe F Glycine Gly G Histidine His H Isoleucine Ile I Lysine Lys K Leucine Leu L Methionine Met M AsparagiNe Asn N Proline Pro P Glutamine Gln Q ARginine Arg R Serine Ser S Threonine Thr T Valine Val V Tryptophan Trp W TYrosine Tyr Y

In addition B may be used for Asx (Aspartate or Asparagine) and X for Glx (Glutamate orGlutamine). J, O and U are not used. The one-letter code is invariably used when comparingand aligning sequences of proteins. Most are easily remembered by their initial letters. Notethat Cysteine and Methionine are the only two sulphur-containing Amino acids.

divergent than they really are. Therefore, it may be advantageous to firsttranslate a coding sequence or open reading frame (ORF) into itscorresponding protein sequence. The results will be a peptide sequence ineither the one- or three-letter code (Table VIII.2).

8.2.1.3. Long Time HorizonHomologous sequences that diverge with time tend to incorporatemutations more ore less at random. This will make the two sequencesmore different from each other when time evolves. The chance that acertain position in the DNA incorporates a second mutation, so obscuringthe first mutational event, or even a back mutation resulting in noobservable difference, will increase with the total number of mutationsthat have been incorporated and thus will also increase with time. Inprotein-coding sequences the first and second position of each codon areless prone to the incorporation of mutations, because this will almostalways lead to a change in amino acid in the corresponding position of theprotein. The third position, also called the wobble position (see above), inmost cases may be mutated without directly affecting the protein. Whencomparing protein-coding sequences that have diverged for possiblyhundreds of millions of years, it is very likely that the wobble bases in thecodons will have become randomised. By excluding the wobble bases byremoving every third nucleotide from the protein-coding sequence (ageneral technique in phylogenetic analyses), one is actually looking at

Page 6: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

6

amino-acid sequences. It may be cumbersome to remove the wobble basesfrom the DNA sequences, while it is often much easier to simply translatean open reading frame into its corresponding protein sequence. There areseveral more reasons why it may be advantageous to translate an openreading frame into protein before carrying out phylogenetic analyses.

8.2.1.4. Advantages of the translation of DNA into proteinDNA is composed of only four kinds of unit: A, G, C and T. If gaps wouldnot be allowed, on average 25% of residues in two randomly chosenaligned sequences would be identical. However, as soon as gaps areallowed, as much as 50 % of residues in two randomly chosen alignedsequences can be identical. Such a situation may obscure any genuinerelationship that may exist between two gene sequences. Especially whencomparing distantly related or rapidly evolving gene sequences. Bycontrast the alignment of proteins with their 20 amino acids is lesscumbersome. On average 5% of residues in two randomly chosen andaligned sequences would be identical. Even after the introduction of gaps,still only 10-15 % of residues in two randomly chosen aligned sequences isidentical. Thus, as a result of the translation of gene sequences into theircorresponding protein sequences the latter are much more easy to align.Thus, translation of DNA into 21 different types of codon (20 amino acidsand a terminator) allows the information to sharpen up considerably andthe signal to noise ratio to improve significantly.

8.2.1.5. Nature of sequence divergence in proteins (the PAM)Mutations in the DNA of protein-encoding genes is transmitted to theircorresponding proteins. With time more and more mutations will beincorporated and as a consequence the two descendants of one ancestralprotein will diverge with time. The observed sequence difference of twoproteins that incorporate mutations is however not linear with time buttakes the course of a negative exponential (Fig. VIII.3). This is the resultnot only of the fact that each position is subject to the incorporation ofmutations, but also to reverse changes ("back mutations") and multiplehits. Such events increase in number as the evolutionary distancebetween two homologous proteins increases. This leads to anunderestimation of evolutionary distances between two homologousproteins and as a consequence the observed percentage of differencebetween two protein sequences is not proportional to their actualevolutionary difference. A measure that is proportional to the true evolutionary distance betweentwo proteins is the PAM value (Dayhoff, 1978). PAM is the number ofAccepted Point Mutations per 100 amino acids. Examples of some PAMvalues and their corresponding observed distances are given in TableVIII.3. Two homologous proteins which have had a common ancestorand have a PAM distance of 250-300 (80-85% distance) or more cannot bedistinguished from two randomly chosen and aligned proteins of similarlength. Therefore, in general, phylogenetic analyses using protein

Page 7: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

7

4003002001000

510152025303540455055606570758085

Pam value

Dis

tan

ce (

%)

Twilight zone

Fig. VIII.3. When the PAM distance value between two distantly related proteins nears thevalue 250 it becomes difficult to tell whether the two proteins are homologous, or that theyare two randomly taken proteins that can be aligned by chance. In that case we speak of the'twilight zone'. (modified from Doolittle, 1987 ).

sequences must be limited to proteins which are less than 80-85%different. Proteins with functions that are less essential to the organismrapidly evolve. As a result evolutionary information is quickly erased andthus such proteins can only be used for the study of closely relatedorganisms. However, house-keeping proteins, such as histones, enzymesof core metabolism and proteins of the cytoskeleton, evolve slowly andincorporate between 1 to 10 mutations per 100 residues and per 100million years (Table VIII.4). Therefore, it takes a considerable time beforethese proteins have incorporated sufficient substitutions before allevolutionary information has been erased. Because of this slow rate ofevolution house-keeping proteins are excellent tools to trace evolutionaryrelationships over long periods of time. For instance the slow mutationrate of the enzyme glutamate dehydrogenase provides us with atheoretical look-back window of several times the age of our solar system.Thus, proteins can be excellent tools to study the evolutionaryrelationships of both closely as well as distantly related taxa.

8.2.1.6. Introns and non-coding DNAWhen confronted with a DNA sequence, a biologist needs to figure outwhere the code for a protein starts and stops. This problem is even moredifficult because a eukaryotic genome contains much more DNA than isneeded to encode proteins; the sequence of a random piece of DNA islikely to encode no protein whatsoever. The DNA which encodesproteins is often not continuous, but rather is frequently scattered inseparate blocks called exons (Fig. VIII.4). Many of these problems can bereduced by sequencing of RNA (via cDNA) rather than DNA itself,because the cDNA contains much less extraneous material, and becausethe separate exons have been joined in one continuous stretch in theRNA. Hundred thousands of these (partial) sequences are now available

Page 8: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

8

Table VIII.3. Correspondence between observed differences betweenproteins and their evolutionary distance

Observed Evolutionary percentage distance in difference PAM

1 1 5 5 10 11 15 17 20 23 25 30 30 38 35 47 40 56 45 67 50 80 55 94 60 112 65 133 70 159 75 195 80 246 85 328 <- Twilight zone

As the evolutionary distance increases, the probability of super-imposed mutation becomesgreater resulting in a lower observed percent difference. (Adapted from Table 23 inDayhoff, 1978).

Table VIII.4. The highly different rates at which proteins evolve.

Type of protein Rate of Theoretical Change Lookback Time (PAMs / 100 myrs) (myrs)

Pseudogenes 400 45 Fibrinopeptides 90 200 Ig lambda chain C region 27 670 Somatotropin 25 800 Ribonucleases 21 850 Haemoglobin alpha chain 12 1500 Acid proteases 8 2300 Cytochrome c 4 5000 Adenylate kinase 3.2 6000 Glyceraldehyde-P dehydrogenase 2 9000 Glutamate dehydrogenase 0.9 18000

Useful lookback time = 360 PAMs(Adapted from Table 1 in Dayhoff, 1978)

Page 9: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

9

TATA box

Transcription initiation

Initiation codon

Stop codon

AATAA

Poly (A)addition site

Exon 1 Exon 2 Exon 3 Flanking regionFlanking region

5' 3'

Intron I Intron II

Fig. VIII.4. Structure of a typical eukaryotic gene with introns and exons.

in the so-called EST (expressed sequence tag) databases that are beingcompiled for many organisms.

Eukaryotic genes in general have been fragmented into exons andinterspersing introns. Due to differences in evolutionary pressure onexons and introns the rate of incorporation of base substitutions in thesetwo elements of eukaryotic genes may be dramatically different.Therefore, a study of the evolution of a protein using its DNA sequenceshould only include coding sequences. This requires that in every DNAsequence all the introns are being edited out. This may be cumbersomeand time consuming. Therefore, it may be easier to translate a cDNA intoits corresponding protein, rather than using the genomic DNA sequences.

8.2.1.7. Multigene familiesOrganisms may contain many highly similar genes, while only onepeptide sequence can be identified (e.g. histones, tubulins andglyceraldehyde-phosphate dehydrogenase in humans). Using these DNAsequences, it would be difficult to decide which genes are expressed andwhich are not and thus to decide which genes to include in the analysis.

8.2.1.8. RNA editingThe DNA sequence doesn't always translate into amino-acid sequence.The pre-mRNA may require alteration of its coding sequence before it canbe translated into a functional protein. This is called post-transcriptionalediting. In post-transcriptional editing different mechanisms are known.These are:

8.2.1.8.1. RNA editing in the Kinetoplastida.This involves the insertion or deletion of one or more uridines in thepre-mRNA, using guide RNAs as templates. This may lead to frameshifts. Even worse, non-coded initiation codons or entire codons foramino acids may be added or removed during the editing process (Fig.VIII.5). This may lead to major differences between the DNA and theresulting mature mRNA sequence (Arts and Benne, 1996). In someextreme cases, such as in Trypanosoma brucei mitochondrial DNA,sometimes more than 50% of a gene is edited (such genes are called pan-edited genes). In these extreme cases DNA and mature mRNA are notable to hybridise anymore.

Page 10: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

10

8.2.1.8.2. Chemical modification of nucleotides.Here specific nucleotides of the transcribed RNA species may be modifiedby chemical or enzymatic modification. Well known examples are themodification of transfer RNA molecules.

8.2.1.9. Some good adviceIt is recommended to analyse a data set both ways (DNA and protein).Keep in mind that: For a group of species or taxa that are relatively closein time or that are closely related (like viral proteins or vertebrateenzymes) DNA-based analysis is probably a good way to go, since youavoid such problems as differences in codon bias or saturation of the thirdposition of codons. It is nevertheless strongly recommended to carry outan analysis on the protein sequence data as well. In the case there isambiguity in the alignment of gene sequences, it is recommended totranslate the sequences to their corresponding protein sequences first,then align and determine the position of gaps in the DNA sequencesaccording to the protein alignment.

12S 9S ND7 CYb A6 CR3COII

MURF1 COI CR4 CR5

ND4S12

ND5

VR ND8CR2

COIII

MURF2

ND1VR

1000 bp Fig. VIII.5. Linear map of the 22-kb maxicircle of T. brucei. The genes above the line aretranscribed from left to right, whereas the genes beneath the line are transcribed from theright to the left. The 9S and 12S rRNA genes have added uridines at their 3' termini(black boxes). Transcripts from Cyb, COII and MURF2 have limited internal editing (blackboxes). Shaded boxes indicate genes that are extensively or pan-edited. The variableregion of the maxicircle is indicated by VR (modified from Arts and Benne, 1996).

Multigene families (for instance genes coding for different, but similar,isoenzymes) may cause problems and one has to be careful in deciding toexclude or include such sequences (this may result in paralogoussequences in the data set and peculiarly looking phylogenetic trees).

8.3. Construction of phylogenetic trees

8.3.1. Alignment of two protein sequencesIn order to study molecular evolution of proteins one has to compare thesequences of homologous proteins. However, such a comparison is noteasy to make because homologous proteins are never identical, but onlysimilar. In addition to having substitutions, there will be insertions anddeletions in one sequence relative to the other. Moreover, phylogeneticanalyses should be carried out on homologous residues only (i.e. thoseresidues in each of the sequences that originate from a common ancestralresidue). Thus it is essential that two or more sequences be properly

Page 11: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

11

Table VIII.5. Programs for the multiple alignment of protein sequences onthe Web

ClustalW or ClustalX http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Darwin http://cbrg.inf.ethz.ch/section3_3.html#secDarwin

ProteinPredict http://www.embl-heidelberg.de/predictprotein/predictprotein.html

Match-Box http://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.html

aligned relative to each other. Several excellent programs have beendeveloped for the alignment of protein sequences and are available in thepublic domain (Table VIII.5). You might go out on the network anddownload and compile the source code, or the executables and run themdirectly on your computer. There are also many places on the Internetwhere alignment programs for two or more sequences can be accesseddirectly and multiple alignments created (see the practical part of thischapter).

8.3.1.1 The visual methodHow do such programs work? The easiest way to understand this is tohave a look at a visual type of alignment of two sequences such as iscarried out by the "dot-matrix" method (Fig. VIII.6). In this method thetwo sequences to be aligned are written out as column and row headingsof a matrix. Dots are put in the matrix wherever the residues in the twosequences are identical. If the two sequences are identical there will bedots in all the diagonal elements of the matrix. If the two sequences aredifferent, but can be aligned without gaps, there will be dots in most of thediagonal elements. If a gap occurred in one of the sequences, thealignment diagonal will be shifted vertically or horizontally.

Fig. VIII.7 shows a dot-matrix for two highly homologous trypanosomephosphoglycerate kinases and two more distantly relatedphosphoglycerate kinase sequences. It is obvious that there is no problemin aligning two sequences as long as they are of similar length and havemore than 50 % identity.

8.3.1.2. Computer algorithmsComputer algorithms that have been developed for the alignment of twohomologous sequences in principle use the same procedure as the dot-matrix method. Each residue of one sequence is compared with eachresidue of the other and when there is an identity a certain value is givento that position in the matrix. Then a diagonal line is drawn connectingthe points with the highest score. Horizontal or vertical shifts from thediagonal due to the presence of gaps are given a penalty.

The choice of the value for each positive score relative to that of the gappenalty will of course strongly influence the quality of the resultingalignment. Too high a gap penalty value will lead to a situation wheredissimilar regions will not be aligned with each other at all, but with gapregions, while a too low gap penalty will lead to alignment of non-homologous residues or regions. Most programs allow the user to select

Page 12: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

12

C D E G L D P G S E R K CDEGLDPGSERK

••

••

••

••

••

••

••

C D E P L D P G S Q R K CDEGLDPGSERK

••

••

••

••

••

••

C D E L D P G S Q R K

CDEGLDPGSERK

••

••

••

••

••

••

C D E D G L S Q L K

CDEGLDPLSERK

••

••

••

••

A B

C D

Fig. VIII.6.Dot matrices for sets of more or less related sequence pairs. Each identitybetween two position leads to a positive score of value 1. A mismatch gives no score. Gapsin the alignment indicated as discontinuities along the diagonal are scored as penalties ornegative scores. A, two identical sequences, identity score = 12; B, two sequences with twomismatches, score = 10; C, two sequences with a gap of one position plus a mismatch, score =10, gap penalty = -1; D, two sequences with both gaps and mismatches. In A, B and C thereis no ambiguity as how to align the two sequences. In example D there are two possibleways to align the two sequences. For the upper diagonal the identity score = 6 with one gapand for the lower diagonal the score = 7 with two gaps.

an appropriate weight matrix for the scoring of either identities orsimilarities of amino acids, for adjustment of the gap penalty value andthe size of the window that scans the diagonal.

Many different weight matrices have been developed for the use withsequence alignment programs to reflect some observed rules of mutation.Some of these are shown in Table VIII.6. It is up to the scientificjudgement of the user, and depending on the dataset that is being

Page 13: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

13

A

B

Fig. VIII.7. A. Dot matrix of two homologous phosphoglycerate kinase sequences fromTrypanosoma brucei and from T. congolense. The two sequences are 81% identical. B. Dotmatrix of two less homologous (50% identical) phosphoglycerate kinase sequences fromEuglena and of Trypanosoma. The latter has an 80 amino-acid long insertion with respectto a partial Euglena sequence.

Page 14: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

14

Table VIII.6 Matrices used in protein sequence comparisons

Identity matrixMutation-cost matrixHydrophobicity matrixLog odds matrixPAM250 matrixBLOSUM 62 (Block Substitution) matrixJTT matrixGonnet matrix

analysed, what kind of weight matrix will be chosen. The identity matrixcontains only the values one for identities and zeros in the case of amutation. The mutation cost matrix scores the minimum number of basechanges required to convert one amino acid into another. It contains thevalues 0, 1, 2 and 3 only. The hydrophobicity matrix takes into account thephysicochemical properties of the amino acids. It assumes that thereplacement of a hydrophobic amino acid by another hydrophobic aminoacid is a more likely event than replacement by a hydrophilic one. Logodds matrices use the log odds ratio S of the probability that two residues, iand j, are aligned by evolutionary descent and the probability that they arealigned by chance [Sij= log(qij/(pi x pj))]. Where [qij] are the frequencies thatresidue i and j are observed to align in sequences known to be related, and[pi] and [pj] are the frequencies of occurrence of residue i and j in the set ofsequences. The PAM250 (Dayhoff, 1978) and Blosum62 (Henikoff andHenikoff, 1992) are log odds matrices, where the two probabilities havebeen determined from, respectively, a limited (PAM250) and a very large(Blosum) database of aligned homologous protein sequences. The lattertwo are implemented in the ClustalW program. The JTT matrix (Jones etal., 1992) is an update to the PAM matrix. The Gonnet matrix ( Gonnet etal., 1992) is a scoring matrix based on the alignment of the entire SwissProtdata base where 1.7 x 106 matches were used from sequences differing by6.4 to 100 PAM. It is implemented in the Darwin programme by the sameauthors. The PAM and Blosum matrices are now widely used inalignment programs.

Several algorithms for the alignment of protein sequences have beendeveloped. Some of the better known are the Pearson-Lipman algorithm(Pearson and Lipman, 1988), used in Pearson's well known FASTAprogram, the Needleman Wunsch (1970), the Smith-Waterman (1981)and the BLAST algorithms (Altschul et al 1990). They are used insequence-comparison programs for the search of homologous sequencesin large sequence databases and in programs used for the multiplesequence alignment.

8.3.1.3. The retrieval of sequences and multiple sequence alignmentIn order to create a multiple alignment of homologous protein sequencesyou have to collect all related sequences from a database. Variousdatabases are available for this purpose. First of all the SwissProt database

Page 15: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

15

(http://www.expasy.ch/) is the most reliable source for the collection ofsequences. In this database each sequence has been checked by experts andextensively annotated. Moreover, homologous sequences from differentorganisms have highly similar locus names, which facilitates therecognition and retrieval of all related sequences. The PIR (ProteinIdentification Resource, http://www-nbrf.georgetown.edu/) databasecontains many more protein sequences, but they have not all beenchecked and is the database is redundant. This is also the case with theGenPept (translated Genbank, http://www.ncbi.nlm.nih.gov/) and theTrEMBL (translated EMBL sequences, http://www.ebi.ac.uk/) databases.However, if you do not want to miss any sequence homologous to yourprotein, even if an incomplete sequence or a pseudogene, these are thedatabases to include in your search. The Enzyme or EC database(http://www.expasy.ch/enzyme/), is a useful tool to retrieve allhomologous sequences of one specific enzyme at once.

For the construction of reliable phylogenetic trees the quality of a multiplealignment of the protein sequences is of the utmost importance. There aremany programs available for the multiple alignment of proteins (TableVIII.5). Most programs quickly align pairs of sequences first and roughlydetermine the degrees of identity between each pair. Then the sequencesare aligned more precisely in a progressive way, using an appropriatemutation matrix, starting with the two most related sequences.

PredictProtein (Table VIII.5; Rost, 1996) is a program for the prediction ofthe secondary structure of a protein. It makes use of the information onthe mutability of each residue in a multiple sequence alignment.Therefore, it first aligns the query sequence with all homologoussequences available in the SwissProt database. So this is an easy way tocollect all homologous sequences in the database and to create a multiplesequence alignment that includes your own query sequence as well.

The algorithm of the Pileup program of the GCG (Genetics ComputerGroup) package resembles that of Clustal and both programs give more orless the same result. GCG is a commercial package for DNA and proteinanalysis. It can be accessed via various servers, provided that you or yourlaboratory has an account on that server. The Darwin server (Table VIII.5)also allows you to align your sequence with homologous sequences in theSwissProt database.

The algorithm of Match-Box (Table VIII.5; Depiereux and Feytmans, 1992;Depiereux et al. 1997) allows the simultaneous alignment of severalprotein sequences where each aligned position is weighted by a reliabilityscore. The Match-Box software aligns sequences on strict statistical criteria.The method circumvents the gap penalty requirement. Gaps are the resultof the alignment and not a governing parameter of the matchingprocedure. A reliability score is provided below each aligned position. TheMatch-Box program is particularly suitable for finding and aligningconserved structural motives within distantly related proteins.

Page 16: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

16

It is advised to try more than one program and to keep in mind that mostmultiple sequence alignment programs work best with sequences ofsimilar length.

8.3.1.3.1. Prodom, Pfam and Blocks databasesAnother way to create your alignment is to first obtain multiple sequencealignments from a database to which your own sequence can be added.There exist databases of prealigned sequences that share domainstructures or homologous blocks of sequences. These databases areProdom, Pfam and Blocks, and are all accessible via the Internet. Theyhave been compiled by comparing and aligning all homologoussequences of the SwissProt database. The Prodom Database(http://www.toulouse.inra.fr/prodom.html) consists of an automaticcompilation of homologous protein domains. The Pfam database(http://www.sanger.ac.uk/Software/Pfam/) is actually formed in twoseparate ways. Pfam-A are accurate human crafted multiple alignmentswhereas Pfam-B is an automatic clustering of the rest of SwissProt andTrEMBL. The Blocks database (http://www.blocks.fhcrc.org/) which hasbeen compiled using the Blast algorithm, contains multiply alignedungapped segments corresponding to the most highly conserved regionsof proteins. Therefore, in general the compiled sequences in Blocks areshorter than in the other two databases. Prodom Pfam and Blocksalignments serve as an excellent basis to start a multiple sequencealignment and/or phylogeny project for your protein.

8.3.1.4. Manual adjustment of a protein alignmentAn automatically produced multiple sequence alignment often needsmanual adjustment to improve its quality, especially at the position ofgaps. Such improvement can be obtained by using all the knowledge thatis available about a protein. Information about active site residues and

elements of secondary structure, such as α-helices, β-strands and loops,may be of great help here. While manually adjusting multiple alignmentsone should have knowledge of the physicochemical properties of the 20amino acids and keep in mind a number of rules of thumb for themutability of the various amino acids. In a folded protein the residues D,R, E, N and K (cf.: Table VIII.2) are preferably mutated to residues ofsimilar properties. Since they are polar or charged they are mainly foundon the surface of the folded protein. Moreover, since they play a lesserrole in protein folding they mutate rather easily. Hydrophobic residues (F,A, M, I, L, Y, V and W) are preferentially replaced by other hydrophobicones. These residues are mainly internal and determine the folding of theprotein. They thus mutate rather slowly. The residues C, H, Q, S and T aregenerally indifferent and may be replaced with any other type of residue.The residues (D, R, E, N, K, C, H, Q, S and T), when conserved throughoutthe alignment, are very likely residues that are involved in the active siteof an enzyme. So the multiple alignment should be adjusted in such away as to maintain these residues aligned. Periodicity of charged residuesmay provide information about the presence of elements of secondary

Page 17: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

17

A

B

Residue number

Hyd

roph

obic

ity

C

Fig. VIII.8.Hydrophobicity (or hydrophathy) profiles according to Kyte and Doolittle (1982) ofhomologous proteins are in general strikingly similar and may provide a tool in thealignment of two or more proteins. The three phosphoglycerate kinase sequences haverespectively 81% (T. brucei (A) vs. T. congolense (B)) and 50% (E. gracilis (C) vs. T. brucei(A)) identity to each other.

structure such as α-helices and β-strands. α-Helices have a repetition of3.6 residues per turn. Stretches of more than 12 amino acids with acharged amino acid every 3rd or 4th position in the sequence may be

indicative of the presence of an α-helix. Short stretches with a repetition

of charged amino acids every 2nd residue may be indicative of a β-strandstructure. Gaps are almost never found in elements of secondary structurebut only in regions with loops. Moreover, the residues P and G interferewith secondary structure elements and thus have a preference for loopregions. Since loops easily acquire or loose residues you should always tryto align gaps together with P and G residues. Hydrophobicity (orhydrophathy) profiles according to Kyte and Doolittle (1982) of twohomologous proteins are in general strikingly similar (Fig. VIII.8) andmay also be of help in manually adjusting the alignment.

Page 18: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

18

A

B

C

D

E

F

G

H

I

OTUs

Root

External nodes

Internalnodes

A-E are external nodes (extant)F-I are internal (ancestral nodes)

OTUs are operational taxonomic unitsThey can be: species

populationsindividualsgenesproteins

They are the extant (existing) OTUsInternal nodes represent ancestralunits.

Topology: order of the nodes on the tree

Fig. VIII.9. Terminology used together with phylogenetic trees

A very useful tool for the manual alignment of proteins, in the case thereexists a crystal structure of at least one of the enzymes in the alignment, isthe NLR-3D database (http://www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d.html). This database contains protein sequences plus secondarystructure information. It tells you which residues of a protein sequence

belong to conserved areas of secondary structure such as α-helices and β-strands. This information, if available, is also provided by the SwissProtdatabase.

8.4. Methods for protein phylogeny

Once a multiple sequence alignment has been prepared, such alignmentmay serve further evolutionary analyses. The final goal of such ananalysis is to prepare an evolutionary tree describing the relationship ofthe various taxa with respect to each other. In order to understand theterminology used in phylogeny, study the hypothetical tree shown in Fig.VIII.9.

There exist various methods for the preparation of evolutionary trees:These are "Distance Methods" based on a matrix containing pair wisedistance values between all sequences in the alignment, and "Character-Based Methods" that carry out calculations on each of the individualresidues of the sequences. In general, distance methods are fast, whilecharacter-based methods are much slower, because they are CPU (centralprocessing unit) intensive. Table VIII.7 gives an overview of the availabletree construction methods.

8.4.1. Distance matrix methodsUPGMA (Unweighted Pair Group with Arithmetic Mean) uses real(uncorrected) distance values and a sequential clustering algorithm. Thismethod of tree construction is very sensitive to differences in branchlength or unequal rates of evolution. Therefore, it should only be usedwith closely related OTUs, or when there is constancy of evolutionary

Page 19: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

19

Table VIII.7. The following tree construction methods for proteins arepublicly available

Distance methods (in the public domain)UPGMALeast Squares Methods.

Fitch and Kitsch of the Phylip package (Felsentein, 1993; Fitch and Margoliash,1967)

Neighbor-joining methods Neighbor of the Phylip package (Felsentein, 1993), ClustalW and X (Thompson et al., 1994), Distnj in the Molphy package (Adachi and Hasegawa, 1992)

Darwin (Gonnet et al, 1992)Character-based methods Protein Maximum likelihood

Protml (Adachi and Hasegawa, 1992, very cpu intensive) Puzzle (Strimmer and Von Haeseler, 1997)

(A heuristic method much faster than Protml) Protein maximal parsimony

Protpars (Felsentein, 1993)

rate. The method is often used in combination with isoenzyme orrestriction site data or with morphological criteria.

8.4.1.1. Transformed distance methods.Corrections of the observed distance values, according to an evolutionarymodel, may be introduced to obtain trees with true evolutionary distances(PAM values, Kimura's formula (Kimura 1983)), or corrections are carriedout with reference to an outgroup. This method is recommended whenevolutionary distant organisms are included in the data set.

8.4.1.1.1. Neighbors relation methods.FITCH (Fitch, 1981) and the Neighbor (Felsenstein, 1993) a Neighbor-Joining method, (Saitou and Nei, 1987) should all be used with correcteddistance matrices. These methods, which all assume an evolutionarymodel for the transformation of the observed distances into evolutionarydistances (e.g. PAM, Blosum etc. matrices or Kimura formula), result inonly one best possible tree. The program Kitsch (Felsenstein, 1993) is amethod with molecular clock.

8.4.2. Character-based methods8.4.2.1. Maximum parsimonyThis method uses sequence information itself rather than distanceinformation. The information content used by this method is notnecessarily larger than for the distance matrix methods, since there is onlya limited number of informative sites. (For more information on what isan informative site, see the chapter by Yves van de Peer......). It calculatesfor all possible trees the tree that represents the minimum number ofsubstitutions at each informative site. The maximum parsimony methoddoes not assume an evolutionary model, and therefore branch lengthsshould not be calculated. This method permits that many equally likely

Page 20: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

20

trees are found. Programs available for maximum parsimony analysis areProtpars (Felsenstein, 1993) and Paup (by David Swofford).

8.4.2.2. Maximum likelihood.This is a method that evaluates a hypothesis about evolutionary historyin terms of the probability that the proposed model and the hypothesisedhistory would give rise to the observed data set. The supposition is that ahistory with a higher probability of reaching the observed state ispreferred to a history with a lower probability. The method searches forthe tree with the highest probability or likelihood. The available programsto analyse protein sequences using maximum likelihood are ProtML ofthe Molphy package (Adachi and Hasegawa, 1992) and Puzzle (Strimmerand von Haeseler, 1997). The latter program applies a heuristic methodand is much faster than ProtML, but does not necessarily guarantee to findthe best tree.

8.4.3. Limitations of the various methodsDistance approaches (UPGMA, corrected distances and neighbor-joining)do not use the original (sequence) data, but derived distance information.Some information is said to be lost. Character-state approaches(maximum parsimony, maximum likelihood) are said to be morepowerful than distance methods because they use the raw data. However,this is usually a small fraction of the data. Maximum parsimony uses onlythe relevant sites. So when the number of informative sites is not large,this method is often less efficient than distance methods (Saitou and Nei,1986). Maximum parsimony is notorious for its sensitivity to codon biasand unequal rates of evolution. None of the methods is reliable whenOTUs with highly unequal evolutionary separation are included in thedata set.

8.5. Trees

8.5. 1. The rooting of treesMost methods for the inference of phylogeny yield trees that areunrooted. Thus from a tree by itself it is impossible to tell which of theOTUs branched off before all the others. To root a tree one should add anoutgroup to the data set. An outgroup is an OTU for which externalinformation (e.g. paleontological information) is available that indicatesthat the outgroup branched off before all other taxa.

When you try to root a tree do not choose an outgroup that is verydistantly related to your taxa. This may result in serious topological errorsbecause sites may have become saturated with multiple mutations bywhich information may have been erased. Do not choose either anoutgroup that is too closely related to the taxa in question. In this case itmay not be a true outgroup. The use of more than one outgroup generallyimproves the estimate of tree topology. In the absence of a good outgroupthe root may be positioned by assuming approximately equalevolutionary rates over all the branches. In this way the root is put at the

Page 21: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

21

midpoint of the longest pathway between two OTUs. This way of rootingis called mid-point rooting.

An interesting way to root a tree is to exploit an event of gene duplication,that took place before speciation and that has led to the formation ofisoenzymes or homologous proteins which then can be used as anoutgroup. Examples of such a duplication are the gene duplication leadingto the various vacuolar ATPases, present in archaebacteria, eubacteria andeukaryotes, which allowed the rooting of the “tree of life” (Gogarten et al.1989).

Tree topologies may strongly depend on the following: distance orparsimony methods applied; the number of OTUs included in thealignment; the order of the OTUs in the alignment; the selection of anappropriate outgroup and finally, the presence of widely varying branchlengths. It should, therefore, be kept in mind that none of the methodsmay guarantee the one tree with the correct topology. So as to have anidea about the reliability of the topology of the resulting tree, one shoulddo one or all of the following: Apply more than one of different methods(distance, maximum parsimony, maximum likelihood) to the data set.Vary the parameters used by the different programs, such as the seedvalue and jumble factor for the order of OTU addition, or change theorder of the sequences manually. When in doubt, apply variousevolutionary models for matrix construction. Add or remove one ormore OTUs and see how this influences the tree topology. Try to includean outgroup that may serve as a root for your tree. Be aware of taxa withvery long branches that may be subject to the so-called long branchattraction effect, which may lead to an anomalous positioning of thattaxon. Apply "bootstrap" or "jacknife" analyses to your data set andprepare a consensus tree of 100 - 1000 replicas (depending on size of thedata set and on computer power). Keep in mind that in the case ofbootstrap analysis only nodes that occur in more than 95% of the cases arereliable. Only when widely different methods provide you with similar oridentical tree topologies and such topologies are supported by goodbootstrap values (>95%) the trees can be considered robust and thusreliable.

8.5.2. Erroneous trees due to paralogous genesThe presence of more than one homologue of a certain gene, or ofdifferent members of a gene family, in one and the same organism,especially as is the case when isoenzymes are present, may complicateconsiderably phylogenetic analyses. When such a situation is encounteredone speaks of the presence of paralogous genes. Two genes are said to beparalogous if they diverged after a duplication event whereas they are saidto be orthologous if they diverged after a speciation event. Let's take theexample of the mammalian lactate dehydrogenase isoenzymes M and L(Fig. VIII.10). In the case of mouse and rat the isoenzymes are the result ofa gene duplication that took place well before the separation of these twospecies. Here one says that the LDH M gene family is paralogous to the

Page 22: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

22

LDH L gene family. In the case one is not aware of the presence ofparalogous genes and isoenzymes in the organisms, because for instanceonly one sequence for each organism (e.g. LDH L for mouse and LDH Mfor rat) is available and isoenzyme data are missing, then the resultingphylogeny would suggest a much earlier separation of mouse and rat.This will inevitably lead to erroneous phylogenetic trees.

LDH LRat LDH L

Mouse LDH L

Rat LDH M

Mouse LDH MLDH M

Fig. VIII.10.The presence of paralogous and orthologous genes within a phylogenetic analysis. ()indicates speciation and X indicates gene duplication. The mouse isoenzymes L and M aresaid to be paralogous, whereas the rat and mouse isoenzymes L are orthologous.

8.5.3. How to draw a high quality tree pictureA high quality picture of the output tree is most conveniently made byusing the TreeView program by Page (1992) that is available free of chargeand runs on personal computers (Macintosh and under MS-Windows). Itcan be retrieved from <http://taxonomy.zoology.gla.ac.uk/rod/treeview.html>. TreeView understands the ClustalW treefileconventions, reads multifurcating trees and is able to simultaneouslydisplay branch lengths and support values for each branch. On aworkstation under Unix you can use the TreeTool program to display andmanipulate trees (ftp://rdp.life.uiuc.edu/rdp/programs/TreeTool/). Alsothe Phylip package by Felsenstein (1993) comes with facilities for thedrawing and modification of high quality trees, such as Retree, Drawtreeand Drawgram.

Page 23: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

23

Chapter 8: Practical 1

A phylogenetic analysis of the Leismanial GPD gene carried out via theInternet.

Here follows a practical exercise for the creation of a phylogenetic treecarried out entirely via the World-Wide Web. The analysis deals with anopen reading frame (ORF) coding for the Leishmania mexicana NAD-dependent glycerol-3-phosphate dehydrogenase (GPD, EC.1.1.1.8), anenzyme which in the haemoflagellated protozoan parasite L.mexicana isassociated with microbodies (Kohl et al. 1995), and for which the crystalstructure was recently solved (Suresh et al, 2000). The analysis is carriedout with no other tools than a computer with access to the World WideWeb and a word processing program such as MS Word, which allows tocut and paste text files. Keep in mind that all files from a word processorshould be saved as “text only” files.

The following analyses are being carried out:

Consult the Enzyme database and collect all available protein sequencesof a specific enzyme in one step

Search the Brookhaven Protein database for a 3-D structureScan a DNA sequence against GenBank using the Blast algorithmScan a DNA sequence for the presence of open reading framesTranslate an open reading frame into a protein sequenceDo a Blast homology search using a protein sequenceReformat sequences with the ReadSeq utilityCreate a multiple alignment of homologous protein sequencesCreate a publishable alignment figureCreate a phylogenetic tree using the maximum likelihood program

Puzzle.

Before starting the project we collect some general information about theenzyme we are going to study. First we connect to the Enzyme, or ECdatabase, available at the Expasy server in Switzerland(http://www.expasy.ch/enzyme/), which holds general informationabout enzymes, their official names and their EC numbers, the reactionsthey catalyse and the pathways they are involved in. It gives also access toall protein sequences available in the SwissProt database. Once you are inthe Enzyme database select the enzyme by its EC number: 1.1.1.8 and getthe requested information. Check out all the information available in thedatabase. Try also to find publications in the Medline database that dealwith the Leishmania enzyme. Now retrieve all the SwissProt entriesreferenced on this page by FTP (file transfer protocol) for future use. Givethis file the name “gpd_sw.pep”.

You should have found, using the information available in the abovedatabase, that by now at least one crystal structure of GPD, the Leishmania

Page 24: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

24

enzyme, has been solved (Suresh et al., 2000). In the case a three-dimensional structure of GPD is already available this can be found in theBrookhaven Protein Database or PDB (http://www.rcsb.org/pdb/) underthe accession codes 1EVY and 1EVZ. If the information is not yet in thedatabase, it should become available very soon.

The gene for the Leishmania GPD and its flanking nucleotides areavailable from the Genbank database at the NCBI (National Center forBiotechnology Information) at the NIH in Bethesda, USA,http://www.ncbi.nlm.nih.gov) under the accession number X89739.Retrieve the sequence and save it on the disk of your computer. The openreading frame is also available from our web site as “gpd_orf.nuc”.

Now you are going to scan the DNA sequence for the presence of openreading frames. Submit the entire nucleotide sequence to the NCBI openreading frame scanner (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) tofind all possible coding regions in all 6 reading frames. The longest openreading frame is found starting at fase +1 from position 643..1743 with alenth of 1101 nucleotides or 366 amino acids. The sequence ends with -SKL, a typical microbody targeting sequence.

To find out whether the Leishmania ORF has any homology to othernucleic acid sequences in the GenBank database, we now perform aBLAST (Basic Local Alignment Sequence Tool) search using the server ofthe NCBI (http://www.ncbi.nlm.nih.gov/BLAST/). Paste the entiresequence of the ORF into the Blast sequence submission window. Use allstandard settings and select the complete GenBank nucleotide database(nr). Submit the sequence and wait for the result to arrive. This may takea few minutes depending on the load on the server.

From the information that is obtained we may conclude that indeed theLeishmania ORF codes for a genuine GPD, but it is intriguing that thehighest degree of identity is reported with bacterial sequences, rather thanwith eukaryotic sequences.

To improve the sensitivity of the Blast search we shall now translate thelongest ORF into its corresponding protein sequence. For the translationwe access the Protein Machine utility at the EBI (European BiotechnologyInstitute, http://www.ebi.ac.uk/translate/). Submit the sequence bypasting the nucleotide sequence comprising the ORF into the sequencewindow of the Protein Machine and then submit it).

After a few seconds the server returns the following output:

MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKRENVLFLKGVQLASNITFTSDVEKAYNGAEIILFVIPTQFLRGFFEKSGGNLIAYAKEKQVPVLVCTKGIERSTLKFPAEIIGEFLPSPLLSVLAGPSFAIEVATGVFTCVSIASADINVARRLQRIMSTGDRSFVCWATTDTVGCEVASAVKNVLAIGSGVANGLGMGLNARAALIMRGLLEIRDLTAALGGDGSAVFGLAGLGDLQLTCSSELSRNFTVGKKLGKGLPIEEIQRTSKAVAEGVATADPLMRLAKQLKVKMPLCHQIYEIVYKKKNPRDALADLLSCGLQDEGLPPLFKRSASTPSKL

Page 25: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

25

A BLAST search of a protein sequence has a much better signal to noiseratio, than a corresponding search of a nucleic acid. Therefore we now usethe Leishmania protein as query sequence in a protein Blast search. Thisway we hope to find all glycerol-3-phosphate dehydrogenase sequences inthe protein databank. Although we could use the translated GenPepdatabase of Genbank, we prefer to search the non-redundant SwissProtdatabase. Although this is a derived database, it has been thoroughlychecked by protein scientists for redundancy and correctness of includedsequence information. Moreover, this database is extensively annotated.For our search we will use the BLAST server at the NCBI again. Pasteyour sequence into the sequence submission window and then select theoptions "Blastp" and the "SwissProt" database. The Blast output shouldbe returned within a few seconds. Because of the higher signal to noiseratio of a protein search, only glycerol-3-phosphate dehydrogenases arereported at the top of the output.

We are puzzled by the fact that in both BLAST searches (in SwissProt aswell as in GenBank) bacterial sequences score much better than theeukaryotic GPD sequences, while we use the eukaryotic protist GPDsequence of Leishmania as the query sequence. Therefore, we decide tocreate a multiple alignment of all the available sequences and to studythe evolutionary relationship of the Leishmanial enzyme with the otherglycerol-3-phosphate dehydrogenases.

The file “gpd_sw.pep” we obtained from the Expasy server by FTP (FileTransfer Protocol) (see above) contains all the GPD sequences alsoreported in our last Blast search. However each entry in the file contains alot of text information not required for our analyses. Moreover, beforeone can use this file the sequences have to be transformed into anappropriate format. One of the problems of the use of freely availablesoftware is that each program and each database uses a different fileformat. The multiple alignment program ClustalW recognises several fileformats, but the server that we are going to use accepts only thePearson/Fasta format. Thus we need to reformat all our sequences.Removal of text and reformatting of the sequences can easily be donemanually, because the Pearson/Fasta format is the simplest formatavailable. However, there exist a very useful format conversion utility onthe web called Readseq, which can do this automatically. To quicklyreformat our sequences we now connect to the Readseq server at the NIH(http://bimas.dcrt.nih.gov/molbio/readseq/) and select as output file thePearson/Fasta format. The entire content of the “gpd_sw.pep” file ispasted into the sequence window and then submitted. The result which isimmediately returned is copied into a text document and saved on diskunder the name “gpd.fasta”. At the top of this file we now add our ownLeishmania protein GPD sequence in exactly the same format as done forthe other sequences and we save the file again under the same name.This file will be used as our input file for the construction of a multiplesequence alignment.

Page 26: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

26

We are now contacting the ClustalW server at the Baylor College inHouston, USA (http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html) and paste the content of the “gpd.fasta” file in Pearson/Fastaformat into the sequence window. Then we submit the data to the serverusing all default settings. After a few minutes the result is returned. Itexists in two parts. The first part contains an alignment in GCG/MSFformat, while the second part contains the same alignment inPearson/Fasta format. This second alignment is now copied and pastedinto a text file and saved to disk as “gpd_aligned.fasta”.

Encouraged by the rapid progress you may decide (albeit a bit prematurely)to already prepare a figure of publishable quality of this alignment. To doso, you go to the BOXSHADE server at the Swiss EMBL node(http://www.ch.embnet.org/software/BOX_form.html) where you maysubmit the sequence alignment in Pearson/Fasta format. The serverallows you to print the result in various ways depending on yourcomputer and your requirements.

Now we have to convert the ClustalW alignment (a Pearson/Fasta formatwas attached to the end of the ClustalW output) to a GCG/MSF formatwhich is very well suited for manual editing in a word processor. This isdone again with the Readseq server at the NIH. The alignment in MSFformat is now saved to disk as “gpd_aligned.msf“. (By the way, Readseqalso allows the creation of Pretty Print alignments ready for publication).(Try with PrettyPrint as output which kind of presentation of thealignment you would prefer).

If you have a sequence editor on your computer, the following steps canbe carried out using this editor. If not, the newly created MSF file is nowopened in MS Word on your computer and the font is changed to a non-proportional font such as courier, point 9. In the case you have on yourcomputer installed a special font which has the actual letters coloured indifferent colours for use with molecular biology applications, such font ishighly recommended. All positions with gaps are now removed from thealignment, since it is difficult to align homologous residues in these areas.(Highlight columns by moving the cursor to one corner of the block andthen clicking the opposite corner while holding down the alt and shiftkeys, or else by moving the mouse while holding the above keys and thebutton down. The highlighted block can then be cut and pasted as usual).It is now also the moment to manually adjust the alignment in the caseyou have observed any obvious alignment errors. In the case a 3-Dstructure was already available the structural information also can be usedto improve the alignment. The edited alignment is now saved to disk asa “text only” file under the name “gpd_corr.msf” and pasted into Readseqagain for reformatting to the Phylip 3.2 format, a format which isrecognised by most phylogeny inference programs. Save the result in a filenamed “gpd_corr.phy”.

Page 27: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

27

For the creation of our phylogenetic tree we are going to use the Puzzleprogram available on the French “Institut Pasteur” server(http://bioweb.pasteur.fr/seqanal/interfaces/Puzzle-simple.html). Submitthe alignment in Phylip format to the server and wait for the results toarrive by email. This may take a few minutes, since the maximumlikelihood method used by Puzzle, even though a heuristic one, is veryCPU demanding. All results are being returned to you by email, includingthe web address where you can access them directly on the server. ThisURL also allows you to run additional analyses on your intermediatedata, such as a neighbor-joining analysis on the ML distance matrix, or thepreparation of a nicely presented figure of the tree file. Create the figure ofyour tree using the program Drawtree of the server and print the result. Itshould look as in Fig. VIII.11. Drawtree does not allow to show internalbranch point labels containing puzzle frequencies or bootstrap values.This information is, however, provided in both the outfile and thetreefile of Puzzle and can be used by other tree drawing programs such asTreeview.

From the topology of the tree it is immediately obvious that the GPD ofthe protist Leishmania clusters with that of Trypanosoma brucei(GPD_TRYBR), another member of the family Trypanosomatidae.Apparently their GPDs are more related to the bacterial GPDs than to anyof their eukaryotic homologues. The clustering of the trypanosomatidswith the bacteria is robust with a puzzle frequency of 96%. A carefulinspection of the position of gaps in the Leishmania and Trypanosomasequences in the multiple alignment confirms this view. Apparently wehave encountered an event of horizontal gene transfer, a discovery whichtook us less than 2 hours after the completion of the DNA sequence. Tofind the explanation for this exciting observation and to write thediscussion for a future paper, we have to await the availability of morebacterial sequences. This will certainly take longer.

GPDA CAEEL

GPDA CUPLA

GPDA SCHPO

GPD1 YARLY

GPD2 YEASTGPD1 YEAST

GPDA RABBIT

GPDA HUMANGPDA MOUSEGPDA RAT

GPDA FUGRU GPDA DROME

GPDA DROPSGPDA DROVI

GPDA DROAEGPDA DROEZ

GPDA TREPA

GPDA SYNY3

GPDA HAEINGPDA COLI

GPD1 LEISH

GPDA TRYBR

96

81

80

65

74

54

99

97

57

85

Fig. VIII.11.Maximum likelihood tree of glycerol-3-phosphate dehydrogenase sequences available inthe SwissProt database. The tree was created entirely using publicly available webtools.Puzzle frequencies at the branch points have been added manually.

Page 28: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

28

Chapter 8 Practical 2

A comparison of the trypanosomatid phylogeny based on glyceraldehyde-phosphate dehydrogenase trees from gene and protein sequences.

This exercise deals with the phylogenetic relationship ofTrypanosomatidae, parasitic flagellated protists. As long as a limitednumber of taxa were included in the 18S rRNA tree the members of thegenus Trypanosoma behaved as a paraphyletic group of organisms, thatseemed to have separated much earlier from the main tree oftrypanosomatid evolution than the other members of the same family(Fernandes et al. 1993). When more and more Trypanosoma rRNAsequences became available and were subsequently added to the tree, thetrypanosomes became a monophyletic group of taxa (Stevens and Gibson,1999) and joined the other Trypanosomatidae, as had been predictedalready from protein data (Wiemer et al, 1995; Hannaert et al, 1998). Theparaphyly of trypanosomes as was originally reported apparently was theresult of the long branch attraction phenomenon. When the longbranches were transformed into smaller ones by the addition of moretaxa, this artefact disappeared. Thus, even within groups of relatedorganisms differences in the evolutionary rate of rRNA may stronglyinfluence tree topologies.

In this exercise we are going to compare the phylogeny of theTrypanosomatidae as inferred from both the gene sequence and thecorresponding protein sequence of the enzyme glyceraldehyde-phosphatedehydrogenase and we shall see that essentially identical monophyletictrees are obtained.

All nucleotide and protein sequences, either partial cDNAs or completegene sequences, are available in the GenBank database under thefollowing accession numbers:Trypanosoma brucei gambiense AF047499Trypanosoma congolense AF047498Trypanosoma vivax AF047500Leishmania major AF047497Phytomonas sp. AF047496Herpetomonas samuelpessoai AF047494Leptomonas seymouri AF047495Crithidia fasciculata AF047493Trypanosoma brucei X59955Trypanosoma cruzi X52898Leishmania mexicana X65226Trypanoplasma borreli X74535Euglena gracilis L39772

Page 29: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

29

Create in your word processor two new files entitled “gapdh.pep” and“gapdh.nuc”. Contact the NCBI server at<http://www.ncbi.nlm.nih.gov/> and enter, one by one, each of theaccession numbers into the search field using GenBank as the database.(A file with all the sequences is also available from our web site). Thencopy for each of the entries both the peptide and nucleic acid sequencesand paste them into the respective word processor files, using theGenbank format (see example below).

LOCUS E_gracilis 353 bpDEFINITION E_gracilis, 353 bases, A1E40336 checksum.ORIGIN 1 MAPVKIGING FGRIGRMVFQ ALCDQGLLGT TFDVVGVVDM ATDADYFAYQ 51 MKYDSVHGKF KHTVSTKKSD ANLAEADIIV VNGHEIKCIM ATRNPEDLPW 101 GKLGVEYVVE STGLFTEADK ARGHLKAGAK KVIISAPGKG DLKTIVMGVN 151 HTEYQASMDV VSNASCTTNC LAPLVHVLLK EGVGVEKGLM TTIHAYTATQ 201 KTVDGPSKKD WRGGRAAAIN IIPSTTGAAK AVGEVLPAVK GKLTGMAFRV 251 PTPDVSVVDL TFLAEKDTSI KEIDSLLKKA SQTYLKGILG FTDEELVSTD 301 FVHDNRSSIY DSLATLQNNL PGEKRLFKVV SWYDNEWGYS NRVVDLLKHM 351 SGN

When all sequences have been added save the file under the same nameand contact the ClustalW server at the Baylor College in Houston, USA(http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html) andpaste the content of the “gapdh.pep” file in Genbank format into thesequence window. Then we submit the data to the server using all defaultsettings. After a few minutes the result is returned. The result exists intwo parts. The first part is an alignment in GCG/MSF format, while thesecond part is the same alignment in Pearson/Fasta format. This secondalignment is now copied and pasted into a text file and saved to disk as“gapdh_aligned.fasta”.

Now we have to convert the ClustalW alignment to a GCG/MSF formatwhich is very well suited for manual editing in a word processor. This isdone again with the Readseq server at the NIH. The alignment in MSFformat is now saved to disk as “gapdh_aligned.msf“.

If you have a sequence editor on your computer, the following steps canbest be carried out using this editor. If not, the newly created MSF file isnow opened in MS Word on your computer and the font is changed to anon-proportional font such as courier point 9. Since the alignmentcontains complete protein sequences as well as partial peptide sequenceswe have to remove all positions with gaps from the alignment (in MSWord vertical columns can easily be selected and deleted at once using thefollowing key combination: alt-shift-mouse button). The edited alignmentis now saved to disk as a “text only” file under the name“gapdh_corr.msf” and pasted into Readseq again for reformatting to thePhylip 3.2 format, which is a format recognised by most phylogenyinference programs. Save the result in a file named “gapdh_corr.phy”.

Page 30: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

30

For the nucleic acid sequences the same steps as for the peptide sequencesare carried out. The final file is saved as gapdh_nuc_corr.phy. This file isalso available from our web site.

The two files in Phylip format with peptide and nucleic acid sequencealignments, respectively, are now used for phylogenetic analyses usingthe French phylogeny server at the Pasteur institute. All the phylogenyinference programs which we are going to use can be accessed from thedirectory: http://bioweb.pasteur.fr/seqanal/interfaces. We are firstcarrying out a bootstrapped neighbor-joining analysis of the peptidesequences. Once we are in the directory we select the document“protdist.html” and paste the content of the file “gapdh_corr.phy” intothe sequence window or browse to locate the file on our disk. As theevolutionary model we choose Kimura for a speedy analysis and we checkthe “multiple datasets” box and take 100 analyses. Then we launch theProtdist program. Protdist now calculates the distances between the pair-wise protein sequences and creates a PAM distance matrix. This matrix isnow used by the program Neighbor to create a neighbor-joining tree.Don’t forget that you have to select again 100 multiple data sets. The 100different tree files so obtained are fed into the program Consense, whichcreates a consensus tree and calculates the bootstrap values for eachbifurcation in the tree. The final tree is drawn using the programDrawtree. Don’t forget to select an output format appropriate for yourcomputer or printer. Print the tree and compare it with the Fig. VIII.12A.

Now create a bootstrapped neighbor-joining tree for the nucleic acid databy repeating the whole procedure described in the above paragraph butuse the programs DNAdist and Neighbor and the file“gapdh_nuc_corr.phy” instead. Print the tree and compare it with Fig.VIII.12B.

T. borreli and E. gracilis do not belong to the Trypanosomatidae, buttogether they belong to the Euglenozoa, to which also theTrypanosomatidae belong. They have been selected as outgroups to beable to root the trees. The two trees obtained by neighbor-joining analysisare very similar, as are the two ML trees shown in Fig. VIII.12. When youuse the same datasets to carry out a MP analysis as well, you’ll find that allmethods generate essentially the same result. The only difference in thetwo ML trees of Fig. VIII.12 is that H. samuelpessoai is paraphyletic in thenucleic acid tree, while it forms a monophyletic group with Phytomonasin the protein tree. However, this part of the tree is not well supported byhigh bootstrap values. The analyses show that firstly, protein and nucleicacid based trees give the same or very similar results and secondly, allTrypanosomatidae form a monophyletic group. Also within this groupthe genus Trypanosoma is by it self monophyletic. Where rRNA treessuffer from unequal rates of evolution resulting in the long branchattraction artefact, trees based on house keeping proteins, whether theseare made with protein sequences or nucleic acid sequences, do suffer lessor not at all from this problem.

Page 31: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

31

0.1

L. mexicana

L. major100

C. fasciculata

L. seymouri

54

99

Phytomonas sp.

H. samuelpessoai

79

100

T. brucei

T. gambiense

T. congolense

100

T. vivax

100

T. cruzi

100

85

T. borreli 2

T. borreli 1

82

E. gracilis0.1

L. mexicana

L. major

100

C. fasciculata

L. seymouri99

100

Phytomonas sp.

93

H. samuelpessoai

98

T. brucei

T. congolense

T. gambiense

100

T. vivax

97

T. cruzi

90

100

T. borreli 1

T. borreli 2

100

E. gracilis

Fig. VIII.12. Comparison of maximum likelihood trees of glyceraldehyde-3-phosphatedehydrogenase sequences made from partial protein sequences (A) and partial genesequences (B). The tree was created entirely using publicly available webtools. Puzzlefrequencies at the branch points have been added manually. The horizontal bar represents10 mutations per 100 residues.

Page 32: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

32

References

Adachi, J and Hasegawa, M (1992). Amino acid substitution of proteinscoded for in mitochondrial DNA during mammalian evolution.Japanese Journal of Genetics, 67, 187-97.

Altschul, S F, Gish W, Miller W, Myers E W and Lipman D J (1990). Basiclocal alignment search tool. Journal of Molecular Biology, 215, 403-10.

Arts, G J and Benne, R (1996). Mechanism and evolution of RNA editingin kinetoplastida. Biochimica Biophysica Acta, 1307, 39-54

Cavalier-Smith, T (1993) Kingdom protozoa and its 18 phyla.Microbiology Reviews, 57, 953-94.

Jones, D T, Taylor, W R and Thornton, J M (1992). The rapid generation ofmutation data matrices from protein sequences. CABIOS, 8, 275-82.

Dayhoff, M O (ed.) (1978). Atlas of protein sequence and structure, Suppl 3,National Biomedical Research Foundation, Silver Spring, MD.

Dayhoff, M O (ed.) (1978). Atlas of protein sequence and structure, Vol. 5,National Biomedical Research Foundation, Silver Spring, MD.

Dayhoff, M O, Schwartz, R M and Orcutt, B C (1978). A model forevolutionary change. In Atlas of protein sequence and structure(Dayhoff, M.O., ed.), vol. 5, suppl. 3, pp. 345-358, National BiomedicalResearch Foundation, Washington, D.C.

Depiereux, E and Feytmans, E (1992). MATCH-BOX - A fundamentallynew algorithm for the simultaneous alignment of several proteinsequences. Computer Applications in the Biosciences, 8, 501-9.

Depiereux, E, Baudoux, G, Briffeuil, P, Reginster, I, De Bolle, X, Vinals, Cand Feytmans, E (1997). Match-Box_server: a multiple sequencealignment tool placing emphasis on reliability. ComputerApplications in the Biosciences, 13, 249-56.

Doolittle R F (1987). Of URFs and ORFs, University Science Books.Felsenstein J (1993). PHYLIP (Phylogeny Inference Package) version 3.5c.

Distributed by the author. Department of Genetics, University ofWashington, Seattle.

Fernandes A P, Nelson K and Beverley S M (1993). Evolution of nuclearribosomal RNAs in kinetoplastid protozoa: perspectives on the ageand origins of parasitism. Proceedings of the National Academy ofSciences U S A, 90, 11608-12.

Fitch, W (1981). A non-sequential method for constructing trees andhierarchical classifications. Journal of Molecular Evolution, 18, 30-7.

Fitch, W M and Margoliash, E (1967). Construction of phylogenetic trees.Science, 155, 279-84.

Germot, A, Philippe, H and Le Guyader, H (1997). Evidence for loss ofmitochondria in Microsporidia from a mitochondrial-type HSP70 inNosema locustae . Molecular and Biochemical Parasitology, 87, 159-68.

Page 33: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

33

Germot, A, Philippe, H and Le Guyader, H (1996). Presence of amitochondrial-type 70-kDa heat shock protein in Trichomonasvaginalis suggests a very early mitochondrial endosymbiosis ineukaryotes. Proceedings of the National Academy of Sciences U S A,93, 14614-7.

Germot, A and Philippe, H (1999). Critical analysis of eukaryoticphylogeny: a case study based on the HSP70 family. Journal ofEukaryotic Microbiology, 46, 116-24.

Gogarten, J P, Kibak, H, Dittrich, P, Taiz, L, Bowman, E J, Bowman, B J,Manolson, M F, Poole, R J, Date, T, Oshima, T, et al. (1989). Evolutionof the vacuolar H+-ATPase: implications for the origin ofeukaryotes. Proceedings of the National Academy of Sciences U S A,86, 6661-5.

Gonnet, G H, Cohen, M A and Benner, S A (1992). Exhaustive Matching ofthe Entire Protein Sequence Database. Science, 256, 1443-5.

Hannaert V, Opperdoes F R, Michels P A M (1998). Comparison andevolutionary analysis of the glycosomal glyceraldehyde-3-phosphatedehydrogenase from different Kinetoplastida. Journal of MolecularEvolution , 47, 728-38.

Henikoff, S and Henikoff, JG (1992). Amino acid substitution matricesfrom protein blocks. Proceedings of the National Academy ofSciences U S A, 89, 10915-19.

Keeling, P J and Doolittle, W F (1996). Alpha-tubulin from early-divergingeukaryotic lineages and the evolution of the tubulin family.Molecular Biology and Evolution, 13, 1297-305.

Kimura, M (1983). The Neutral Theory of Molecular Evolution .Cambridge, University Press, Cambridge.

Kohl, L, Drmota, T, Do Thi, C D, Callens, M, Van Beeumen, J, Opperdoes,F R and Michels, P A M (1996). Cloning and characterization of theNAD-linked glycerol-3-phosphate dehydrogenases of Trypanosomabrucei brucei and Leishmania mexicana mexicana and expression ofthe trypanosome enzyme in Escherichia coli. Molecular andBiochemical Parasitology, 76, 159-73.

Kyte, J and Doolittle, R F (1982). A simple method for displaying thehydropathic character of a protein. Journal of Molecular Biology,157, 105-32.

Needleman, S B and Wunsch, C D (1970). A general method applicable tothe search of similarities in the amino acid sequences of twoproteins. Journal of Molecular Biology, 48, 443-53.

Page, R D (1996). TreeView: an application to display phylogenetic trees onpersonal computers. Computer Applications in the Biosciences , 12,357-8.

Patterson, D J and Sogin, M L (1992). Eukaryotic origins and protistandiversity. In The origin and evolution of the cell (Hartmann H andMatsuno K, eds) pp 13-47. World Scientific Publishing Co. RiverEdge, NJ.

Pearson, W R and Lipman, D J (1988). Improved tools for biologicalsequence comparison. Proceedings of the National Academy ofSciences U S A, 85, 2444-8.

Page 34: Phylogenetic analyses using protein sequences.opperd/chapter8/Chapter8.pdf · The tree was constructed from representative phosphoglycerate kinase protein ... every other amino acid

34

Philippe, H and Adoutte, A (1998). The molecular phylogeny ofEukaryota: solid facts and uncertainties. In EvolutionaryRelationships amongst Protozoa (The systematics Association specialvolume series 56, Coombs, G H, Vickerman, K, Sleigh, M A andWarren, A eds.) pp 25-56. Kluwer Academic Publications, Dordrecht,The Netherlands

Philippe, H and Germot, A (2000). Phylogeny of eukaryotes based onribosomal RNA: long-branch attraction and models of sequenceevolution. Molecular Biology and Evolution, 17, 830-4.

Rost, B (1996). PHD: predicting one-dimensional protein structure byprofile based neural networks. Methods in Enzymology, 266, 525-39.

Saitou, N and Nei, M (1987). The neighbor-joining method: a newmethod for reconstructing phylogenetic trees. Molecular Biology andEvolution , 4, 406-25.

Smith, T F and Waterman, M S (1981). Identification of commonmolecular subsequences. Journal of Molecular Biology, 147, 195-7.

Stevens, J R and Gibson, W (1999). The molecular evolution oftrypanosomes. Parasitology Today, 15, 11432-7.

Strimmer, K and von Haeseler, A (1997). Likelihood-mapping: a simplemethod to visualize phylogenetic content of a sequence alignment.Proceedings of the National Academy of Sciences U S A, 94, 6815-9.

Suresh, S, Turley, S, Opperdoes, F R, Michels, P A M and Hol W G J (2000).A potential target enzyme for trypanocidal drugs revealed by thecrystal structure of NAD-dependent glycerol-3-phosphatedehydrogenase from Leishmania mexicana . Structure, 8, 541-52.

Swofford, D PAUP (Phylogenetic Analysis Using Parsimony).Smithsonian Institution, Washington, Version 4.0beta, SinauerAssociates of Sunderland, Massachusetts.

Thompson, J D, Higgins, D G and Gibson, T J (1994). CLUSTAL W:improving the sensitivity of progressive multiple sequencealignment through sequence weighting, positions-specific gappenalties and weight matrix choice. Nucleic Acids Research, 22,4673-80.

Wiemer, E A, Hannaert, V, van den IJssel, P R, Van Roy, J, Opperdoes, F Rand Michels, P A M (1995). Molecular analysis of glyceraldehyde-3-phosphate dehydrogenase in Trypanoplasma borelli: anevolutionary scenario of subcellular compartmentation inkinetoplastida. Journal of Molecular Evolution, 40, 443-54.

Woese, C R and Fox, G E (1977). Phylogenetic structure of the prokaryoticdomain. The primary kingdoms. Proceedings of the NationalAcademy of Sciences U S A, 74, 5088-90.