functional divergence and genome evolution of vertebrate protein
TRANSCRIPT
Iowa State UniversityDigital Repository @ Iowa State University
Retrospective Theses and Dissertations
2003
Functional divergence and genome evolution ofvertebrate protein kinases (kinome)Jianying GuIowa State University
Follow this and additional works at: http://lib.dr.iastate.edu/rtd
Part of the Computational Biology Commons, Genetics Commons, and the Molecular BiologyCommons
This Dissertation is brought to you for free and open access by Digital Repository @ Iowa State University. It has been accepted for inclusion inRetrospective Theses and Dissertations by an authorized administrator of Digital Repository @ Iowa State University. For more information, pleasecontact [email protected].
Recommended CitationGu, Jianying, "Functional divergence and genome evolution of vertebrate protein kinases (kinome)" (2003). Retrospective Theses andDissertations. Paper 1434.
Functional divergence and genome evolution of vertebrate
protein kinases (kinome)
by
Jianying Gu
A dissertation submitted to the graduate faculty
in partial fulfillment of the requirement for the degree of
IX)CTOR OF PHILOSOPHY
Major: Bioinformatics and Computational Biology
Program of Study Committee: Xun Gu, Co-major Professor
Dan Nettleton, Co-major Professor Linda Ambrosio
Gavin Naylor Daniel Voytas
Iowa State University
Ames, Iowa
2003
Copyright © Jianying Gu, 2003. All rights reserved.
UMI Number: 3105077
UMI UMI Microform 3105077
Copyright 2003 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 300 North Zeeb Road
P.O. Box 1346 Ann Arbor, Ml 48106-1346
ii
Graduate College
Iowa State University
This is to certify that the doctoral dissertation of
Jianying Gu
has met the dissertation requirements of Iowa State University
Co-major rofessor
Co-major Professor
For the Major Program
Signature was redacted for privacy.
Signature was redacted for privacy.
Signature was redacted for privacy.
iii
TABLE OF CONTENTS
CHAPTER I: GENERAL INTRODUCTION 1
Introduction 1
Dissertation Organization 3
References 5
CHAPTER H: NATURAL HISTORY AND FUNCTIONAL DIVERGENCE IN
PROTEIN TYROSINE KINASES 8
Abstract 8
Introduction 8
Methods 9
Results and Discussion 11
Conclusions 23
Acknowledgement 23
References 23
CHAPTER HI: EVOLUTIONARY PERSPECTIVES FOR FUNCTIONAL
DIVERGENCE OF JAK PROTEIN KINASE DOMAINS AND TISSUE-
SPECIFIC GENES 27
Abstract 27
Introduction 28
Methods 29
Results 31
Discussion 40
Acknowledgements 43
References 43
CHAPTER IV: EVOLUTIONARY ANALYSIS OF FUNCTIONAL
DIVERGENCE IN TGF-g SIGNALING PATHWAY 47
Abstract 47
iv
Introduction 47
Methods 48
Results and Discussion 49
Conclusions 55
Acknowledgement 56
References 56
CHAPTER V: EVOLUTIONARY PATTERN OF PROTEIN KINASE
EXPANSION IN HUMAN GENOME 59
Abstract 59
Introduction 59
Data and Methods 60
Results 64
Discussion 74
Acknowledgements 75
References 75
CHAPTER VI: GENERAL CONCLUSIONS 78
Summary of Findings 78
Ongoing and Future Work 80
ACKNOWLEDGEMENTS 81
VITA 82
1
CHAPTER I: GENERAL INTRODUCTION
Introduction
Over the past two decades, advances in biology have increased exponentially, leading
to an enormous wealth of genomic (the quantitative study of genes, regulatory and noncoding
sequences), transcriptomic (RNA and gene expression), and proteomic (protein expression)
information. Novel ways of using this information are emerging, with the hope that many
unsolved problems in biology and medicine can be more fully explored. Evidently, an
integrative approach that combines computational, quantitative, and experimental analysis
brings new knowledge, novel methods, and innovative technologies to engender a better
understanding of complex biological systems and processes. Many of these post-genomic
attempts are centered at elucidating the function of gene products and functional interacting
networks by unveiling the evolutionary forces that have shaped the network organization.
Gene/domain duplication, functional divergence, site-specific rate shift, and age
distribution of vertebrate gene families
A central component in vertebrate cellular and developmental networks is gene
families, in which a group of genes with common ancestry display sequence similarity but
exert divergent functions (Lundin 1993; Hughes 1994; Spring 1997; Lynch and Conery
2000). It has long been thought that duplication of individual genes, chromosomal segments,
or entire genomes serves as a primary resource for the function novelties in a vast number of
gene families such as HOX (Ohno, 1970; Sidow 1996; Li et al. 2001). Domain
duplication/shuffling has also been postulated to play additional role in increasing the
functional complexity in multi-domain gene-family networks (Doolittle 1995; Henikoff et al.
1997). Yet still under debate, the functional divergence observed in gene families largely
relies on the diverging fates of duplicated genes: (1) non-functionalization, in which one
copy retains the original function and the other copy accumulates deleterious mutations and
becomes a functionless pseudogene; (2) neo-functionalization, in which one of the duplicates
retains its original function while the other is under relaxed selection constraint and may
acquire novel function (Li, 1983); (3) Alternatively, sub-functionalization, in which each of
2
duplicates retains complementary part of the ancestral functions (Force et al. 1999; Lynch
and Force 2000).
Provided the importance of gene/domain duplication in gene family innovation, it is
intriguing to quantitatively trace the changes of evolutionary constraints after a duplication
event. There is a general belief that not all, but only a few amino acid residues contribute to
functional divergence after gene/domain duplication (Golding and Dean 1998). How to
identify these residues from the background evolutionary noise is a challenge. Initially,
sequence similarity score was used as an index for the level of functional divergence, with a
caveat of incompetence to detect the true functional divergence, because most amino acid
changes are neutral or nearly neutral which are not directly related to functionality (Eisen
1998). Instead of sequence similarity, pattern of amino acid replacement, including changes
in the rate of replacement, appears to be a better indicator for changes of functional/selective
constraint, which is likely to be inversely related to the evolutionary rate of amino acid
replacement (Li and Graur 1991). Apparently, algorithms based on empirical rules may not
be sufficient to capture the complex stochastic nature of evolution (Fetrow and Skolnick
1998; Montelione and Anderson 1999; Irving et al. 2001). Hence, a statistical framework
that maps the association between inferred site-specific rate shifts and changes in function
after gene/domain duplication is important (Gu 1999; Gaucher et al. 2001; Gu 2001). In
addition to the statistical modeling of the functional divergence after gene/domain
duplication, inferring the age-distribution of gene families that arose from duplication events
will provide a deeper insight in the evolution of fine-tuned hierarchical cell network. The
observation of up to four paralogs of many Drosophila genes in the human genome provides
indirect evidence for the classical (two-round) hypothesis of vertebrate genome duplication
(Ohno 1970). However, over the decades, the distinct modes of large-scale genome
duplication vs. continuous small-scale gene duplication are hotly debated (Martin 2001;
Friedman and Hughes 2001; Spring 1997; Wolfe 2001; McLysaght et al. 2002; Gu et al.
2002). Our recent study of vertebrate gene families (Gu et al. 2002) revealed that both large-
and small-scale gene duplications have significant contribution to build the current hierarchy
of the human proteome. Also recent segmental duplications have been shown to be involved
in building up the extreme dynamism and genomic fluidity over very short periods of
3
evolutionary time (Eichler 2001). This series of studies have shed significant light on
complicated mechanisms underlying the current diversity of vertebrate genomes.
Kinome: key machinery in cell network
Among thousands of gene families, kinase superfamiiies that catalyze the reversible
protein phosphorylation play a central role in regulating basic functions including DNA
replication, cell cycle control, gene transcription, protein translation, and metabolism.
Protein phosphorylation is also required for more advanced functions in higher eukaryotes
including cell, organ, and limb differentiation, cell survival, synaptic transmission, cell-
substratum and cell-cell communication, and to mediate complex interactions with the
external environment. Moreover, aberrant protein phosphorylation is commonly related with
cancer and other human diseases. The completion of human genome sequence provides us
an opportunity to decipher the molecular nature of kinase-mediated signal transduction
machinery (Venter et al. 2001). A complement of 518 kinases, defined as kinome (Manning
et al. 2002), was identified in the human genome, constituting about 1.7% of all human
genes. All of the known bona fide protein kinases share a conserved catalytic kinase domain
of -270 amino acids. Based on their characteristic kinase domain, eukaryotic Protein
Kinases (ePKs) can be divided into two distinct families, protein tyrosine kinases (PTKs) and
protein serine/threonine kinases (PSKs). Tyrosine protein kinase, one major animal-specific
subclass, is likely monophyletic, derived from a precursor of protein serine-threonine kinase
during the course of animal evolution, while the major types of PSKs had existed in the early
stage of eukaryotes (Hanks et al. 1988; Hanks and Hunter, 1995; Hunter 1995). In addition
to ePKs, a group of atypical Protein Kinases (aPKs) possess biochemical kinase activity,
despite the lack of sequence similarity to the ePK domain. Interestingly, the structural
similarity of aPKs to the authentic ePK domains may account for their enzymatic activity
(Yamaguchi et al. 2001).
Dissertation Organization
The objective of this study is to explore the functional divergence and evolutionary
pattern in the vertebrate kinome, using a combinatorial bioinformatic and evolutionary
approach. This dissertation is composed of three sections: a general introduction, four
4
chapters of research reports, each of which is in journal manuscript format, and a general
conclusion. The core of the five interrelated reports includes:
Chapter II summarizes a comprehensive phylogenetic analysis on major PTK gene
families. We found that the expanding of each PTK family was likely caused by gene or
genome duplication event(s) that occurred before the emergence of bony fishes but after the
vertebrate-amphioxus split. Our further investigation on the evolutionary pattern revealed
that site-specific shifted evolutionary rate (altered functional constraint) is a common pattern
in PTK gene family evolution (Gu 1999).
Chapter III is a case study in Jak kinase family. The study was focused on the
statistical modeling and detection of functional divergence among duplicate genes. Our
result showed that shifted selective constraints (or shifted evolutionary rates) between
homologous JH1 and JH2 domains arc statistically significant. Moreover, we found that
after the (first) gene duplication, site-specific rate shifts between member genes Jak2/Jak3
and Jakl/Tyk are significant, indicating a consequence of functional divergence among these
member genes.
Chapter IV extends the study to kinase-mediate network: TGF-f3 signal transduction
pathway which is triggered by receptor serine/threonine kinases at the cell surface. We have
statistically tested the functional divergence in major pathway components, targeted the
critical amino acid residues responsible for functional divergence. Interestingly, altered
functional constraints appeared to be a common pattern in the evolution of TGF-(3 pathway.
Moreover, we have found the correlation between structural divergence and functional
divergence in the extracellular ligand-binding domain of Type II receptor kinases.
Chapter V extends the study to the impact of evolutionary forces that drive the
vertebrate kinome organization. The important evolutionary events including gene
duplication, domain duplication/shuffling, gene loss, and gene nonfunctionalization, which
might have contribution to the existing kinome, have been identified. The age distribution of
human kinome shows that the major kinase mediated signaling is generated by domain
shuffling/duplications in the early evolution of eukaryotes, and subsequent large scale
gen(om)e duplication is involved in the emergence of tissue-specific or developmental stage-
specific kinase isoforms in the early stage of vertebrate. Our study of kinase pseudogenes
5
further demonstrated the fates of duplicate gene: gain new functionality or loss original
function to turn into pseudogenes.
References:
Doolittle RF. (1995) The origins and evolution of eukaryotic proteins. fMor Thrnj j'oc
lowfggfo/ Scz. 349:235-240.
Eichler EE. (2001) Recent duplication, domain accretion and the dynamic mutation of the
human genome. Trends Genet. 17:661-669.
Eisen JA. (1998) Phylogenomics: improving functional predictions for uncharacterized
genes by evolutionary analysis. Genome Res. 8:163-167.
Fetrow JS, Skolnick J. (1998) Method for prediction of protein function from sequence using
the sequencc-to-structure-to-function paradigm with application to lutaredoxins/thioredoxins
and T1 ribonucleases. J Mol Biol. 281:949-968.
Force A, Lynch M, Pickett FB, A mores A, Van YL, Postlethwait J. (1999) Preservation of
duplicate genes by complementary, degenerative mutations. Genetics. 151:1531-1545.
Friedman R, Hughes AL. (2001) Gene duplication and the structure of eukaryotic genomes.
Genome Res. 11:373-381.
Golding GB, Dean AM. (1998) The structural basis of molecular adaptation. Mol Biol Evol.
15:355-369.
Gaucher EA, Miyamoto MM, Benner SA. (2001) Function-structure analysis of proteins
using covarion-based evolutionary approaches: Elongation factors. Proc Natl Acad Sci US A.
98:548-552.
Gu X. (1999) Statistical methods for testing functional divergence after gene duplication.
Mol Biol Evol. 16:1664-1674.
Gu X. (2001) Maximum-likelihood approach for gene family evolution under functional
divergence. Mol Biol £Vo/. 18:453-464.
Gu X, Wang Y, Gu J. (2002) Age distribution of human gene families shows significant roles
of both large- and small-scale duplications in vertebrate evolution. Nat Genet. 31:205-209.
Hanks SK, Quinn AM, Hunter T. (1988) The protein kinase family: conserved features and
deduced phytogeny of the catalytic domains. Science. 241:42-52.
6
Hanks SK, Hunter T. (1995) Protein kinases 6. The eukaryotic protein kinase superfamily:
kinase (catalytic) domain structure and classification. FASEB J. 9:576-596.
Henikoff S, Greene EA, Pietrokovski S, Bork P, Alt wood TK, Hood L. (1997) Gene families:
the taxonomy of protein paralogs and chimeras. Science. 278:609-614.
Hughes AL. (1994) The evolution of functionally novel proteins after gene duplication.
Proc R Soc Lond B Biol Sci. 256:119-124.
Hunter T. (1995) Protein kinases and phosphatases: the yin and yang of protein
phosphorylation and signaling. Cell. 80:225-236.
Irving J A, Whisstock JC, Lesk AM. (2001) Protein structural alignments and functional
genomics. Profema. 42:378-382.
Li WH. (1983) Evolution of duplicate genes and pseudogenes. In:Nei M and Keohn RK (eds)
Evolution of Genes and Proteins. Sinauer Associ ates, Sunderland, M A, pp. 14-37
Li WH, Graur D. (1991) Fundamentals of Molecular Evolution. Sinauer, Sunderland, MA,
USA.
Li WH, Gu Z, Wang H, Nekrutenko A. (2001) Evolutionary analyses of the human genome.
Nature. 409:847-849.
Lundin LG. (1993) Evolution of the vertebrate genome as reflected in paralogous
chromosomal regions in man and the house mouse. Genomics. 16:1-19.
Lynch M, Conery JS. (2000) The evolutionary fate and consequences of duplicate genes.
Sconce. 290:1151-1155.
Lynch M, Force A. (2000) The probability of duplicate gene preservation by
subfunctionalization. Genetics. 154:459-473.
Manning G, Why te DB, Martinez R, Hunter T, Sudarsanam S. (2002) The protein kinase
complement of the human genome. Science. 298:1912-1934.
Martin A. (2001) Is tetralogy true? Lack of support for the "one-to-four rule". Mol Biol Evol.
18:89-93.
McLysaght A, Hokamp K, Wolfe KH. (2002) Extensive genomic duplication during early
chordate evolution. Nat Genet. 31:200-204.
Montelione GT, Anderson S. (1999) Structural genomics: keystone for a Human Proteome
Project. Nat Struct Biol. 6:11-12.
Ohno S (1970) Evolution by gene duplication. Springer, New York.
7
Sidow A. (1996) Gen(om)e duplications in the evolution of early vertebrates. Curr Opin
Genef Dev. 6:715-722.
Spring J. (1997) Vertebrate evolution by interspecific hybridisation—are we polyploid?
FE&SIgff. 400:2-8.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M,
Evans CA, Holt RA, et al. (2001) The sequence of the human genome. Science. 291:1304-
1351.
Yamaguchi H, Matsushita M, Nairn AC, Kuriyan J. (2001) Crystal structure of the atypical
protein kinase domain of a TRP channel with phosphotransferase activity. Mol Cell. 7:1047-
1057.
Wolfe KH. (2001) Yesterday's polyploids and the mystery of diploidization.
Mzf ikv Gengf. 2:333-341.
8
CHAPTER H: NATURAL HISTORY AND FUNCTIONAL DIVERGENCE IN
PROTEIN TYROSINE KINASES
A paper accepted by Gene1
Jianying Gu 2, Xun Gu 2'3
Abstract Cellular signaling is the core of many biological processes including growth,
differentiation, adhesion, motility and apoptosis. The protein tyrosine kinase (PTK)
supergene family is the key mediator in cellular signaling in metazoans, directly associated
with a variety of human diseases. All PTKs contain a highly conserved catalytic kinase
domain, in spite of variable multi-domain structures. Within each PTK gene family, members
exhibit functional divergence in substrate-specificity or temporal/tissue-specific expression,
although their primary function is conserved. After conducting phylogenetic analysis on
major PTK gene families, we found that the expanding of each PTK family was likely caused
by gene or genome duplication event(s) that occurred before the emergence of teleosts but
after the vertebrate-amphioxus split. We further investigated the evolutionary pattern of
functional divergence after gene duplication in those gene families. Our results show that
site-specific shifted evolutionary rate (altered functional constraint) is a common pattern in
PTK gene family evolution.
1. Introduction Protein kinases catalyze the transfer of a phosphate group from its donor to an
acceptor. They can be divided into two major groups: protein tyrosine kinases (PTKs) and
serine/threonine kinases (PSKs) (Hubbard and Till, 2000). Since PTKs play important roles
in regulating intracellular signaling pathways and are responsible for the development of
many cancers, they have served as drug targets for many different disease therapies (Blume-
1 Accepted by Gene, 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Iowa State University. 3 Author for correspondence
9
Jensen and Hunter, 2001; Sridhar et al., 2000). Therefore, there are general interests to
explore how much functional-related information can be obtained from molecular
evolutionary analysis.
PTKs can be further divided into receptor tyrosine kinases (RTKs) and non-receptor
(cytosolic) tyrosine kinases. Typically, a cascade of signals is initiated from an RTK after
ligand binding, which leads to receptor dimerization, kinase activation, and
autophosphorylation of tyrosine residues. These phosphorylated tyrosines then serve as
docking sites for recruiting downstream signaling molecules, including non-receptor tyrosine
kinases that can trigger a variety of cell responses. The extensive cross talk between PTK-
triggered pathways increases the complexity of the signaling process (see Schlessinger, 2000
for a review).
The human genome project has opened new opportunity in the study of PTK-
mediated signaling. The first draft of complete human sequence predicted a total of 106
PTKs, 58 are receptor kinases and 48 are non-receptor kinases (Venter et al., 2001). Despite
that different PTK subfamilies have different domain structures and functions, all PTKs share
a conserved kinase domain that is responsible for their catalytic functions and structures. To
unveil the intrinsic functional diversity among PTKs, natural history of gene family
expansion, and the underlying evolutionary mechanism, we performed extensive
phylogenetic-based analysis on 27 PTK families. Our study may provide some new insights
for understanding the complexity of signal-transduction networks in the animal kingdom.
2. Materials and Methods 2.7 cofkcdof*
The full-length sequences of vertebrate PTK gene families used in this study were
obtained from the Hovergen database (http.7/pbil.univ-lvon 1 .fr/). with reference from
literatures (Robinson et al., 2000; Blume-Jensen and Hunter, 2001). Exhaustive PSI-BLAST
search against Non-redundant Protein databases at NCBI was performed to find homologous
sequences in invertebrates such as D. Melanogaster and C. elegans, which were used as
outgroups. The complete alignment of kinase domain of PTKs was obtained from the Pfam
(http://pfam.wustl.edu/). The chromosome synteny of individual PTKs in human genome
10
was obtained by the map viewer at NCBI (http://www.ncbi.nlm.nih.gov/cgi-
bin/Entrez/map search/). The final data set including the following PTK families:
(I) Receptor tyrosine kinases: Axl, discoidin domain receptor (DDR), epidermal
growth factor receptor (EGFR), ephrin receptor (EphR), Fibroblast growth factor
receptor (FGFR), hepatocyte growth factor receptor (HGFR), insulin receptor
(InsR), Leukocyte tyrosine kinase (Ltk/Alk), muscle-specific kinase (Musk),
platelet-derived growth factor receptor (PDGFR), Ret, receptor orphan (Ror),
Ros, Ryk, Tie, tropomyosin-related kinase (Trk), and vascular endothelial growth
factor receptor (VEGFR);
(II) Non-receptor tyrosine kinases: Abelson tyrosine kinase (Abl), acetate kinase
(Ack), C-terminal src kinase (Csk), focal adhesion kinase (Fak), fps/fes related
kinase (Fer/Fes), fyn-related kinase (Frk), Janus kinase (Jak), Src, Spleen tyrosine
kinase (Syk), and Tec.
2.2 AfwAipk a/wf owxfyg»
The multiple alignment of each gene family was obtained by ClustalX program
(http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html), followed by manual adjustment.
Local alignments were reconciled with the kinase domain defined in the Pfam database by
Hidden Markov Models (HMMs). Phylogenetic trees were inferred by the Neighbor-joining
(NJ) method using MEGA2.0 (http://megasoftware.net/) with Poisson distance. The
robustness of tree topology to the tree making method was examined by PAUP4.0
(http://paup.csit.fsu.edu/) and PHYLIP (http://evolution.genetics.washington.edu
/phvlip.html/). Bootstrap analysis was carried out to assess support for individual node.
2.3 Tïnw q/geme «fMpKcafio»
A linearized neighbor-joining tree (Takezaki et al., 1995) was used to convert the
(average) Poisson distance of protein sequences to the molecular time scale by using
software package MEGA2.0. In this study, the divergence time of primate-rodent (80 million
year ago, my a), mammal-bird (310 my a), mammal-amphibian (350 mya), and tetrapod-
teleost (430 mya) and vertebrate-Drosophila split (830 mya) were employed as calibrations
(Kumar and Hedges, 1998; Gu, 1998; Wang and Gu, 2000).
11
2.4 Type / /zmcfioMa/ dfvefyence amifyiw
Type I functional divergence refers to the evolutionary process that results in altered
functional constraints (or site-specific evolutionary rate shift) between two duplicate genes,
regardless of the underlying evolutionary mechanisms (Gu, 1999). A statistical framework
modeling the functional divergence was used to estimate the coefficient of functional
divergence (0), an indicator of the level of Type I functional divergence of two homologous
gene clusters (Gu, 1999; Wang and Gu, 2001; Gu et al., 2002a). Rejection of null hypothesis
Ho: 0 = 0 provides statistical evidence for shifts in inferred evolutionary rate (or altered
functional constraints) in duplicates.
Moreover, the sites with critical contribution to the overall functional divergence can
be predicted by the posterior analysis. Let Q (k) = P% (St|X) be the posterior probability of a
site k being Si (functional divergence related status) when the amino acid configuration (X)
is observed. Since the alternative status So (functional divergence unrelated status), with
posterior probability P% (Sq|X) = 1- P% (S,|X), means no altered functional constraint, the
predicted residues are only meaningful when Q (k) > 0.5, in which case the posterior odds
ratio R (S,/So) = P (S,|X)/ P (Sq|X) > I. A more stringent cut-off may be Q (k) > 0.67, or
R(S,/So) > 2. The computational software DIVERGE for Type I functional divergence
analysis and prediction is available at http://xgul.zool.iastate.edu/.
3. Results and Discussions 3.1 The natural history of protein tyrosine kinase gene families
PTKs form a large but diverse superfamily with the characteristic catalytic domain
conserved across different gene families. The classification of these gene families is based
upon ligand specificity, biological function and primary structure. Within each gene family,
the member genes are conserved in the sequence as well as the domain structures, but distinct
in tissue or developmental-stage or ligand specificity.
Gene (domain) duplication plays an important role in eukaryote evolution by
providing the primary source for the evolutionary novelty (Ohno, 1970). Recently we (Gu et
al., 2002b) have demonstrated that both large-scale (genome-wide) and small-scale
duplications have significant contributions on the presence of numerous multigene families.
12
The phylogenetic tree based on the alignment of kinase domain sequences suggested that the
whole set of PTKs was generated from a series of domain duplications in the early stage of
animal evolution, as marked by the "ancient component" in Gu et al. (2002b). These gene
(domain) duplications have given rise to the current complex network of signal-transducting
in animals (Hanks et al., 1995; Suga et al., 1997).
Furthermore, phylogenetic tree for each gene family in the PTK superfamily has
revealed the impact of genome-wide duplication(s) in the early stage of vertebrates (Gu et al.,
2002b). Moreover, we found no e vidence for the contribution of recently gene duplication
(around or after mammalian radiation) to the origin of tissue-specific PTKs. Because of
space limitation, in the following we only discuss two typical cases.
EphR gene family
To date, 14 highly related ephrin receptors have been identified in vertebrates. They
can be divided into two classes according to the ligand-binding preference: EphA receptors
(EphAl-AS) recognize glycosyl-phosphatidyl-inositol (GPI)-anchored ephrin-A ligands (Al-
A5), whereas EphB receptors (EphBl-B6) preferentially bind to ephrin-B ligands (B1-B3)
that span the membrane via a transmembrane domain (Gale et al., 1996). One exception is
EphA4, a receptor that can bind and respond to B as well as A-class ephrins.
The phylogenetic tree of ephrin receptors was inferred based on the whole length
amino acid sequences, while its root was tentatively determined by the phytogeny of PTK
superfamily. It shows that the A1/A2 ephrin receptor group first branched-off, probably after
the split between vertebrates and Drosophila (Fig.l). The subsequent gene duplication has
given rise to the two distinct classes of ephrin receptors; the rest of Class A and all the Class
B ephrin receptors fall into separate clades with 93% bootstrap value. In each class more
tissue-specific isoforms were generated by substantial gene duplications. Overall, the
evolution and diversity of ephrin receptors was driven by both small-scale and large-scale
gene duplications in the early stage of vertebrates. Of course, more vertebrate homologous
genes (e.g., fishes) are needed to determine the phylogenetic interval of some duplication
events. Nevertheless, molecular-clock analysis (Wang and Gu, 2000; Gu et al., 2002b)
generally supports our notion (Fig.l).
13
X'
s,
93
X 67
N 53
99j _ EPA5 HUMAN H- EPA5 RAT si- EPA5 MOUSE
ffd
EPHA5 B»A5 CHICKEN
B»A3 HUMAN B»A3 MOUSE
98 i— EPA3 RAT • EPA3 CHICKEN
r EPA6 HUMAN HtrEPAB MOUSE
94L EPA6 RAT
B4IA3
PHA6
BPHA7 56, EPA7 HUMAN
• EPA7 CHICKEN • EPA7 RAT •BW MOUSE • EPA3 BRARE
r- EPA8 HUMAN iôôi EPA8 MOUSE
5»HA8 EPA4 HUMAN B»A4 MOUSE
EPA4 CHICKBI EPA4 FROG
- EPA4 ZEBRAFiSH
EPH A4
100 r- EPB3 HUMAN —| r EPB3 RAT
97LB=B3 MOUSE — EPB3 CHICKBI
EPB3FROG EPB3 ZEBRAFISH _
EPB4 ZEBRA FISH2 B=B4 ZEBRA F1SH1
EPHB3
82
95 82
100
95 r I
É' 98 f
1É
— BPB4 FROG r EPB4 MOUSE 1— EPB4 HUMAN 100
EPB1 HUMAN B»B1 RAT EPB1 CHICKBI EPB1 FROG
EPB2 HUMAN EPB2 MOUSE EPB2 QUAIL EPK CHICKBI
EPB2FROG
B>HB4
EPHB1
EPHB2
- EPB6 HUMAN -B>B6 MOUSE |B»HB8
H
EPB5 CHICK i, EPA2 HUMAN
EPA2 MOUSE EPA2 FROG
BWA2
• EPA1 HUMAN " •EPA1 MOUSE
• C.elegans • D.melanogaater
• EPA1 CHICKBI B»HA1
Fig. 1. Phylogenetic tree of the Eph receptor gene family inferred by the neighbor joining
method with poisson correction. Eph receptor family contains 14 highly related member
genes which can be divided into two classes Eph A receptors and EphB receptors with
bootstrap value equal to 93%. Several duplication time points were indicated: (A) 403.1
mya; (B) 481.7 mya; (C) 526.0 mya; (D) 435.3 mya; (E) 477.7 mya; (F) 322.5 mya; (G)
596.6 mya; (H) 747.7 mya; (I) 784.0 mya.
14
If the rooting of EphR tree is largely correct, the evolution of ligand-binding
preferences can be inferred from the phylogenetic analysis. That is, the ancestor of EphR
may bind only ephrin-A ligands, while the function related to ephrin B ligands binding was
derived relatively recently. This pattern is consistent with the classic theory of function
innovation after gene duplication: one copy kept the original function (ephrin-A ligands) and
the other one acquired novel function (ephrin-B ligands) through accumulation of amino acid
change (Li, 1983). Interestingly, a similar pattern has also been observed within Class B
ephrin receptors: Eph B2 and Eph B3 have been demonstrated to possess partial functional
redundancy in midline guidance of nerve system, implying their common feature in
preserving ancestral functions of Class B receptors (Cowan et al., 2000). Noticeably Eph B6
shows remarkably long branch length. One possibility is that it has undergone loss-of-
function divergence due to the loss of crucial sites for tyrosine kinase activity and the relaxed
functional constraints (Gurniak et al., 1996).
Src ggf#/wmfy
Src cytoplasmic tyrosine kinase family plays crucial roles in a variety of cellular
processes, such as cell cycle control, cell adhesion, cell motility, cell proliferation and cell
differentiation (Thomas et al., 1997). Extensive studies indicate that the complexity of
functional roles of Src kinases mainly comes from their capability of communicating with a
large number of upstream receptors and downstream effectors. The NJ tree shows that
member genes of the Src family can be divided into two major distinct groups: (A) Src, Yes,
Fyn, Pgr and Yrk; and (B) Lyn, Hck, Lck and Blk (Fig. 2). The three nearest neighbors to
the Src family, tyrosine kinases Frk, Brk and Srm, were used to root the NJ tree.
There exists distinct differences between group A and B at gene expression level.
The three group A member genes (Src/Fyn/Yes) are ubiquitously expressed; whereas all the
group B member genes (Lck/Hck/Lyn/Blk) and one group A member gene (Fgr) have more
tissue-restricted expression, mainly in hematopoietic cells. Research also showed significant
structural difference between group A and B Src genes. The resolved crystal structures of
human Src (Group A) and human Hck (Group B) reveal the significant geometrical
difference between these two groups of proteins (Williams et al., 1998). Furthermore, the
observation of swapping the SH2 and SH3 domains of Lck (group B) with the corresponding
15
HUM AN
9 9 s r c42
FLY HUMAN
M O U S E HUM AN
M OUSE
HUMAN RAT M OUSE CHICKEN FROG
X.x lph ld lum (F ISH) HUM AN
DOG M OUSE HICKEN
FROG X. he l le r ! (F ISH)
HUM AN M OUSE C H I C K E N
HUMAN _P RAT
991- FROG X. he l le r l (F ISH) CHICKEN
HUMAN M OUSE RAT
9 9 i H U M A N M O U S E r H U M A N
SS_n HUMAN - M OUSE CHICKEN
FROG la loo HUMAN
OUSE 9 9 ' RAT
9 , HUMAN a I HUMAN b
9 , MOUSE a M OUSE b RAT a RAT b FROG
src64B
Src (20ql2-ql3)
Yes (18pll.3)
Fyn (6q21)
Yrk (?)
Fgr (Ip36.2-p36.1)
BIk (8p23-p22)
Lck/Tkl (Ip35-p34.3)
Hck (20qllql2)
Lyn (8ql3-qter)
Frk (?)
Brk (?)
Srm (?) FLY shark
0.10
Fig. 2. Phylogenetic structure of the Src gene family. The Src family can be divided into
two major distinct groups: group A including Src, Yes, Fyn, Fgr and Yrk; group B including
Lyn, Hck, Lck and BIk. Kinases Frk, Brk and Srm are used as outgroups to root the NJ tree.
Identified chromosomal locations of member genes are listed according to human genome
map.
16
domains of Src (group A) abolished the regulation of Src activities (Gonfloni et al., 1997)
confirmed the non-replaceable functions. It is noteworthy that the N-terminus SH4 domain is
very short and ubiquitously conserved for all Src member genes, indicating its common
functional roles in the long run of evolution.
3.2 Evidence of gene duplication in PTK
The recent mounting evidence suggested that, in addition to continuous flux of small-
scale gene duplications, there was at least one genome duplication event that occurred during
early chordate evolution (Gu et al., 2002b; McLysaght et al., 2002). It is of particular interest
to examine whether PTK gene families were generated through such gen(om)e
duplication(s).
Age distribution of gene duplication in PTK families
Using the same approach as described by Gu et al. (2002b), we have dated the time of
gene duplication events that gave rise to the PTK families. Among a total of 56 investigated
duplication events, 38 occurred during the short time window 430-750 Mya which falls
between the emergence of teleosts and the split of vertebrate- amph i ox us (Fig. 3).
The estimated time of gene duplications In PTK families
301- 351- 401- 451- 501- 551- 601- 651- 701- 751- 801- 851- 901-350 400 450 500 550 600 650 700 750 800 850 900 950
time (mya)
Fig. 3. Age distribution of PTK gene duplication events. The molecular time scale is
measured as Myr ago. Each bin for the histogram is 50 Myr.
17
Gu et al. (2002b) showed that both large- and small- scale duplications contributed to
the current hierarchy of vertebrate genome: While a continuous mode of gene duplication is
exhibited since the vertebrate-fruitfly split, a rapid increase in the number of paralogous
genes was observed during the time period 430-750 Myr ago. The age distribution of PTK
genes suggests the big-band explosion in the early vertebrate stage is a dominant force for
shaping the diversity of functional specificity, where a continuous mode (constant rate of
gene duplication) plays secondary role (chi-squared test with p<O.OOO5). Using a different
approach, a similar conclusion was obtained by Miyata and his coworkers (Iwabe et al.,
1996: see Miyata and Suga, 2001 for review).
Chromosome distribution of paralogous genes
Chromosome mapping of paralogous genes may provide useful information for the
occurrence of genome duplication. We are aware that due to the frequent reshuffling and
natural selection after chromosome duplications, it is difficult to reconstruct the real history
of chromosome duplication events. Nevertheless, the present chromosome synteny still
preserves some trace of the physical re-organization after gene duplication. For instance, a
number of paralogous chromosomal regions (or paralogons) have been successfully
identified recently (Popovici et al., 2001; McLysaght et al., 2002). Such paralogon in PTKs
was reported as well: all the member genes of Fibroblast growth factor receptor (FGFR)
family and seven other families were located in the same vicinity of chromosomal regions:
4p l6, 5q33-35, 8p 12-21, and 10q21 -26 (Pebusque et al., 1998), indicating the possible
occurrence of block duplication.
PDGFR/VEGFR gene family
The PDGFR/VEGFR family comprises by eight member genes: Pdgfr-a and Pdgfr-(3,
colony-stimulating factor 1 receptor (CSF1-R), stem cell factor receptor (SCFR, commonly
known as c-Kit), Fit-3 (growth factor receptor for early hematopoietic progenitors), Vegfr-1,
Vegfr-2, and Vegfr-3. This gene family seems to be vertebrate-specific since homologous
sequence has not been found in invertebrates such as Drosophila and C. elegans. The
analysis of phylogenetic hierarchy along with chromosome synteny provides striking
evidence for the occurrence of the duplication of chromosome blocks during PDGFR family
evolution. These eight member genes appear to have been arisen from a common ancestor,
18
based on their sequence similarity and their common structural features (e.g. high hydrophilic
insertion into the catalytic kinase domain). Their di versification may be generated by a
couple of gene duplication events. As shown in Figure 4, the first major duplication event
took place in the early stage of vertebrates, at least prior to the divergence between tetrapods
and teleosts. It involves the tetraploidization of three chromosome regions 13ql 2, 4ql2, and
5q33-35. Consequently, one of the copies evolves to the present VEGFR isoforms 1, 2, and
3. Analogously, Flt3 may represent the other descendant in the region of 13ql2. In contrast,
complex changes have occurred in the close proximities of 4ql2 (Kit and Pdgfr-a and
counterparts) and 5q33-35 (CSF1R and Pdgfr-(3) due to another round of gene duplication
(Rousset et al., 1995). It is striking that the tandemly linked chromosome synteny is largely
preserved over more than 400 million years. One plausible explanation is the crucial
functional role of this group of receptor kinases decreases the possibility of radical
chromosomal re arrangement within the neighborhood.
Src /o/mZy
In the Src gene family, the chromosome synteny also shows an interesting pattern.
As shown in Figure 2, Src and Hck may be the descendant of a common ancestor located
nearby human chromosome 20q, and Fgr and Lck may have arisen from the common
ancestor in the proximity of chromosome lp35-36. Though the entire puzzle of Src
evolution remains unresolved, those observations lead us to hypothesize gene duplication
(chromosome tetraploidization) as an important evolutionary mechanism underlying the
organization of present-day Src genes.
3.3 Site-specific evolutionary rate shift in PTK gene families
We have conducted pair-wise functional divergence analysis between paralogous
genes for each gene families. Gene clusters with less than four sequences were excluded
from analysis due to insufficient information. Table 1 shows the coefficient of functional
divergence (0) of pair-wise comparisons of PTK superfamily. Forty-two (42) out of 43
comparisons showed 0 > 0 with p < 0.05, suggesting that site-specific rate shift after gene
duplication is a common phenomenon in PTKs evolution.
We have noticed an association between predicted functional divergence (via site-
specific rate-shift) and observed (experimental or phenotypic) trait. For example, in the Src
19
Chromosome synteny
99
99
67
99
99
99
99
99
23
99
HUMAN 99 , BOVINE
!9_T i GOAT I— PIG
CAT DOG
99 1 DOG HORSE
i MOUSE 99 RAT
POSSUM CHICKEN FROG
•ZEBRAFISH — HUMAN - CAT
r - M OUSE 99 L RAT
99 0 99 CE
99
• FUGU HUMAN MOUSE RAT
CHICKEN FROG — ZEBRAFISH
HUMAN MOUSE FUGU HUMAN
MOUSE
99 HUMAN
99 99
MOUSE - RAT
HUMAN r - M OUSE
n—
99
99 •— RAT CHICKEN
99 | HUMAN I MOUSE
CHICKEN ZEBRAFISH
KIT
CSF1R
4ql2
5q33.2-33.3
PDGFR a 4ql2
| PDGFRp 5q31-32
I FLT3 13ql2
VEGFR1 13ql2
VEGFR2 4ql2
VEGFR 3 5q35.3
Fig. 4. Phylogenetic tree of the PDGFR gene family. Phylogenetic hierarchy along with
chromosome synteny provides evidence for the occurrence of the duplication of chromosome
blocks during the PDGFR family evolution. The chromosome region 13ql2 (Flt3 and
VEGFR), 4ql2 (Kit, PDGFR-a and VEGFR2), and 5q33-35 (CSF1-R, PDGFR-p and
VEGFR3) might be generated from a common ancestral chromosomal block following a
series of chromosome duplications.
20
Table 1: The coefficient of functional divergence (0) of pairwise comparisons of tissue-
specific genes of protein tyrosine kinase gene families.
Gene family name Pairwise comparison" 6 ± se
Receptor Kinase
Epidermal Growth Factor Receptor (EGFR)
Egfr (5) ErbB2 (4) 0.306db0.084
Receptor Kinase
Epidermal Growth Factor Receptor (EGFR) ERfr(5) ErbB3/4 (6) 0.184+0.046
Receptor Kinase
Epidermal Growth Factor Receptor (EGFR)
ErbB2 (4) ErbB3/4 (6) 0.132±0.103
Receptor Kinase
Insulin receptor (InsR/Igr-lR/Irr) InsR (7) Igr-1R (8) 0.137±0.041
Receptor Kinase
Platelet-derived growth factor receptor (PDGFR)
Kit (14) Fms (5) 0.215+0.032
Receptor Kinase
Platelet-derived growth factor receptor (PDGFR) Kit (14) Pdgfr (9) 0.161+0.026
Receptor Kinase
Platelet-derived growth factor receptor (PDGFR)
Kit (14) Vegfr(ll) 0.287±0.031
Receptor Kinase
Platelet-derived growth factor receptor (PDGFR)
Fms (5) Pdgfr (9) 0.098±0.035
Receptor Kinase
Platelet-derived growth factor receptor (PDGFR)
Fms (5) VeKfr(ll) 0.342±0.043
Receptor Kinase
Platelet-derived growth factor receptor (PDGFR)
PdK& (9) Vegfr (11) 0.233±0.032
Receptor Kinase
Fibroblast growth factor receptors (FGFR)
F*#l (7) Ff#2(9) 0.275±0.075
Receptor Kinase
Fibroblast growth factor receptors (FGFR) F#1 (7) FK&3 (9) 0.407±0.066 Receptor
Kinase
Fibroblast growth factor receptors (FGFR)
Fgfrl (7) Fgfr4(8) 0.322±0.061 Receptor Kinase
Fibroblast growth factor receptors (FGFR)
Fgfr2 (9) Fgfr3 (7) 0.210±0.074
Receptor Kinase
Fibroblast growth factor receptors (FGFR)
Fgfr2 (9) Fgfr4 (8) 0.342±0.075
Receptor Kinase
Fibroblast growth factor receptors (FGFR)
Fgfr3 (7) Fgfr4 (8) 0.338±0.060
Receptor Kinase
Trk TrkB (5) TrkC (5) 0.527±0.091
Receptor Kinase
Trk TrkB (5) Trk-related (4) 0.754±0.063
Receptor Kinase
Trk
TrkC (5) Trk-related (4) 0.782±0.077
Receptor Kinase
Hepatocyte growth factor receptor Met (6) Ron (6) 0.222±0.037
Receptor Kinase
Ephrin receptor EphA (23) EphB (23) 0.159+0.031"
Receptor Kinase
Tie/Ret Tie (8) Ret (6) 0.444+0.056
Receptor Kinase
Axl Mer (4) Tyro3 (5) 0.298±0.058
Receptor Kinase
Axl Mer (4) Rorl/2 (4) 0.763±0.092
Receptor Kinase
Axl
Tyro3 (5) Rorl/2 (4) 0.796±0.094
Nonreceptor kinase
Src Src (6) Yes (6) 0.283±0.095
Nonreceptor kinase
Src Src (6) Fyn (8) 0.362+0.073
Nonreceptor kinase
Src
Src (6) Lck (4) 0.728±0.094
Nonreceptor kinase
Src
Src (6) Lyn (7) 0.728±0.098
Nonreceptor kinase
Src
Src (6) Brk/Srm (4) 0.466+0.163
Nonreceptor kinase
Src
Yes (6) Fyn (8) 0.424±0.098
Nonreceptor kinase
Src
Yes (6) Lck (4) 0.871±0.112
Nonreceptor kinase
Src
Yes (6) Lyn (7) 0.661+0.111 Nonreceptor kinase
Src
Yes (6) Brk/Srm (4) 0.431±0.222 Nonreceptor kinase
Src
Fyn (8) Lck (4) 0.642±0.081
Nonreceptor kinase
Src
Fyn (8) Lyn (7) 0.768±0.094
Nonreceptor kinase
Src
Fyn (8) Brk/Srm (4) 0.566+0.088
Nonreceptor kinase
Src
Lck (4) Lyn (7) 0.582±0.102
Nonreceptor kinase
Src
Lck (4) Brk/Srm (4) 0.315+0.127
Nonreceptor kinase
Src
Lyn (7) Brk/Srm (4) 0.246±0.179
Nonreceptor kinase
Janus kinase Jakl(7) Jak2 (6) 0.351+0.041
Nonreceptor kinase
Janus kinase Jakl(7) Jak3 (6) 0.280±0.040
Nonreceptor kinase
Janus kinase
Jak2(6) Iak3 (6) 0.361+0.034 a) The number inside the parentheses represents the number of sequences for that
member gene. b) Eph A including Eph A3, Eph A4, EphA5, EphA6, EphA7 and EphA8; EphB including
EphB 1, EphB2, EphB3, EphB4, EphB5 and EphB6.
21
family, 0 between two major groups A and B is estimated to be 0.290 ± 0.045, suggesting
substantial alterations in their site-specific selective constraints, reflected by the difference of
expression, structure and functional role between two groups. In particular, in group A
genes, 0 between Fyn and Src is estimated to be 0.362 ± 0.073. This functional divergence is
well demonstrated by the knock-out assay:fyn mice have alterations in a specific region of
brain, the hippocampus, where some layers have more neurons than wild type mice and cause
the respective layers to undulate; in contrast, src mice resembled wild type controls (Grant et
al., 1992).
Amino acid residues responsible for such functional divergence after gene duplication
can be predicted based on a site-specific profile, which represents the posterior probability of
a site being functional divergence related status given the amino acid configuration. By
choosing a suitable cut-off value, we will obtain a list of candidate sites, which could be
mapped onto the secondary or domain structures. For example, in the Trk family, the
posterior analysis predicts 27 amino acid sites critical for functional divergence between
TrkB and TrkC, given the cutoff value Q (k) > 0.7 (Note: 0 between TrkB and TrkC is
estimated to be 0.527 ± 0.091). Three of predicted sites are located in the Leucine-rich
domain, two in the first Ig-like domain, and two in the kinase domain. Interestingly, another
ten sites reside inside the second Immunoglobulin (Ig)-like domain, which has been shown
the dominant element for neurotrophin-binding specificity in the crystallographic studies
(Uifer et al., 1998; Wiesmann et al., 1999). The spatial distribution of these predicted sites
indicates that they are not equally dispersed through the whole coding region. Instead, the
majority (17/27) are located inside the functional domains, indicating the importance of
domain structure in the functional innovation after gene duplication.
The predicted sites also provide hint for the association between site-specific rate
shift and molecular phenotypes, for example, as shown in hepatocyte growth factor receptor
(HGFR) family. The HGFR family induces mitogenic, motogenic and morphogenic cellular
response, as well as tumorgenesis by triggering a multi-step genetic program called 'invasive
growth' including cell-dissociation, invasion of extracellular matrices and growth. Two
member genes (Met and Ron/Sea) have been identified through human to Fugu, a teleost fish
(Fig. 5a). The functional divergence analysis shows significant changes in the functional
22
(a)
100
100
HUMAN
•MOUSE
•CHICKEN
100
100
99
100
100
•FROG
• FUGU3
FUGU2
HUMAN
•MOUSE
RAT i_r" 100 I I
Ron
•CHICKEN
FROG
FUGU1
Met
0.1
(b) (c)
Site-specific profile for functional divergence in HGFR fam ily
Alignment position
M et 101 201 301 401 501 601 701 801 901 1001 1101 1201
site k
Category I II
1 4 4 5 5 7 8 4 6 8 8 2 0 2 1 0 4 7
1111 2 1 1 1 4 9 0 0 9 6 4 4 9 0 6
RON_HUMAN LPCAMG TNLRV RON_MOU S E LPCAMG TTIQA RON_CHICKEN LPCAMG DHEEG RON_FROG LPCAMG LLRDQ PRGFR3_FUGU LPCAVG GQSST PRGFR2_FUGU LPCAMG STAVS MET_MOUSE MFSAIE DBASE MET_RAT MFSAIE DBASE MET_HUMAN IFRPIG DBASE MET_CHICKEN MLDAVG DDASE MET_FROG LIISVQ DDASE PRGFR1_FUGU FYESTK DDASE
Fig. 5. Functional divergence analysis of HGFR gene family. The overall coefficient of
functional divergence between Met and Ron is 0 ± se = 0.222 ± 0.037. (a) Phylogenetic tree
of HGFR gene family, (b) The site-specific profile of posterior probability of functional
divergence between Met and Ron. (c) Amino acid configuration of sites with Q (k) > 0.6.
Predicted sites can be divided into two categories: category I contains sites conserved in Ron
but variable in Met; and category II includes sites conserved in Met but variable in Ron.
23
constraints between Met and Ron (8 = 0.222 ± 0.037). Furthermore, posterior analysis
implies strong variation in evolutionary rate shift of each site. As shown in figure 5b, the
baseline of posterior probability profile is between 0.2-0.3, suggesting that most amino acid
residues do not have significant effects on the overall functional divergence. Only a small
portion of residues has high posterior probability to be responsible for the functional
divergence between Met and Ron. With the cut-off value Q (k) > 0.6, 11 sites were
predicted, which can be divided into two categories: category I contains sites conserved in
Ron but variable in Met; and vice versa, category II includes sites conserved in Met but
variable in Ron (Fig. 5c). Given the observation that the knock-out mice lacking met have
placental-lethal defect, but ron embryos are viable through the blastcyst stage of
development (Uehara et al., 1995; Muraoka et al., 1999), we speculate that Met has
indispensable functions and is likely under more stringent functional constraint than Ron.
Indeed, the highly conservation of these category II sites in Met may be very important to
maintain the specific Met function.
4. Conclusions In conclusion, our comprehensive evolutionary analysis on PTKs reveals that (1) both
large-scale and small-scale gene duplications are the major evolutionary force for generating
the contemporary PTK superfamily. (2) Substantial functional divergence occurred after
gene duplication(s), characterized by the significant shift in evolutionary rates. (3)
Evolutionary functional divergence is correlated with the phenotypic functional divergence in
paralogous genes. These results not only shed light on the role of gene duplication in the
development of hierarchical PTK-mediated network, but also provide impetus for a new
approach to predict functionary divergence from evolutionary changes.
Acknowledgement We thank Yufeng Wang for helpful discussion. This study is supported by the NIH grant
ROI GM62118toX. G.
References
24
Blume-Jensen, P., Hunter, T., 2001. Oncogenic kinase signalling. Nature 411(6835),
355-365.
Cowan, C.A., Yokoyama, N., Bianchi, L.M., Henkemeyer, M., Fritzsch, B., 2000. EphB2
guides axons at the midline and is necessary for normal vestibular function. Neuron 26(2),
417^30.
Gale, N.W., Holland, S J., Valenzuela, D.M., Flenniken, A., Pan, L„ Ryan, T.E.,
Henkemeyer, ML, Strebhardt, K., Hirai, H., Wilkinson, D.G., Pawson, T., Davis, S.,
Yancopoulos, G.D., 1996. Eph receptors and ligands comprise two major specificity
subclasses and are reciprocally compartmentalized during embryogenesis. Neuron 17, 9-19.
Gonfloni, S., Williams, J.C., Hattula, K., Weijland, A., Wierenga, R.K., Superti-Furga, G.,
1997. The role of the linker between the SH2 domain and catalytic domain in the regulation
and function of Src. EMBO J. 16, 7261-7271.
Grant, S.G., O'Dell, T.J., Karl, K.A., Stein, P.L., Soriano, P., Kandel, E.R., J992. Impaired
long-term potentiation, spatial learning, and hippocampal development in fyn mutant mice.
Science 258,1903-1910.
Gu, J., Wang, Y., Gu, X., 2002a. Evolutionary Analysis for functional divergence of Jak
protein kinase domains and tissue-specific genes. J. Mol. Evol. 54, 725-733.
Gu, X., 1998. Early Metazoan divergence was about 830 million years ago. J. Mol. Evol. 47,
369-371.
Gu, X., 1999. Statistical methods for testing functional divergence after gene duplication.
Mol. Biol. Evol. 16, 1664-1674.
Gu, X., Wang Y., Gu, J., 2002b. Age distribution of human gene families shows significant
roles of both large- and small-scale duplications in vertebrate evolution. Nature Genetics 31,
205-209.
Gurniak, C.B., Berg, L.J., 1996. A new member of the Eph family of receptors that lacks
protein tyrosine kinase activity. Oncogene 13(4), 777-786.
Hanks, S.K., Hunter, T., 1995. The eukaryotic protein kinase superfamily: kinase (catalytic)
domain structure and classification. FASEB J. 9, 576-596.
Hubbard, S R., Till, J.H., 2000. Protein tyrosine kinase structure and function.
Annu. Rev. Biochem. 69, 373-398.
25
Iwabe, N., Kuma, K., Miyata, T., 1996. Evolution of gene families and relationship with
organismal evolution: rapid divergence of tissue-specific genes in the early evolution of
chordates. Mol. Biol. Evol. 13, 483-493.
Kumar, S., Hedges, S.B., 1998. A molecular timescale for vertebrate evolution. Nature 392,
917-920.
Li, W.H., 1983. Evolution of duplicate genes and pseudogenes. In: Nei M, Keohn RK (eds)
Evolution of genes and proteins. Sinauer Associates, Sunderland, MA, pp 14-37.
McLysaght, A., Hokamp, K., Wolfe, K.H., 2002. Extensive genomic duplication during early
chordate evolution. Nature Genetics 31, 200-204.
Miyata, T., Suga, H., 2001. Divergence pattern of animal gene families and relationship with
the Cambrian explosion. BioEssays 23, 1018-1027.
Muraoka, R.S., Sun, W.Y., Colbert, M.C., Waltz, S.E., Witte, D P., Degen, J.L., Fiiezner
Degen, S.J., 1999. The Ron/STK receptor tyrosine kinase is essential for peri-implantation
development in the mouse. J. Clin. Invest. 103(9), 1277-1285.
Ohno, S., 1970. Evolution by gene duplication. Springer-Verlag, Berlin.
Pebusque, M.J., Coulier, P., Birnbaum, D., Pontarotti, P., 1998. Ancient large-scale genome
duplications: phylogenetic and linkage analyses shed light on chordate genome evolution.
Mol Biol Evol. 15(9), 1145-1159.
Popovici, C„ Leveuglc, M., Birnbaum, D., Coulier, P., 2001. Homeobox gene clusters and
the human paralogy map. FEB S Lett. 491(3), 237-242.
Robinson, D., Wu, Y., Lin, S., 2000. The protein tyrosine family of the human genome.
Oncogene 19,5548-5557.
Rousset, D., Agnes, R., Lanchaume, P., Andre, C., Galiberl, P., 1995. Molecular evolution of
the genes encoding receptor tyrosine kinase with immunoglobulin like domains. J. Mol. Evol.
41,421-129.
Schlessinger, J., 2000. Cell signaling by receptor tyrosine kinases. Cell 103(2), 211-225.
Sridhar, R., Hanson-Painton, O., Cooper, D R., 2000. Protein kinases as therapeutic targets.
Pharm. Res. 17(11), 1345-1353.
Suga, H., Kuma, K., Iwabe, N., Nikoh, N., Ono, K., Koyanagi, M., Hoshiyama, D., Miyata,
T., Intermittent divergence of the protein tyrosine kinase family during animal evolution.
1997. FEES Lett. 412(3), 540-546.
26
Takezaki, N., Rzhetsky, A., Nei, M., 1995. Phylogenetic test of the molecular clock and
linearized trees. Mol. Biol. Evol. 12(5), 823-833.
Thomas, S.M., Brugge, J.S., 1997. Cellular functions regulated by Src family kinases. Annu
Rev Cell Dev Biol. 13, 513-609.
Uehara, Y., Minovva, O., Mori, C, Shiota, K., Kuno, J., Noda, T., Kitamura, N., 1995.
Placental defect and embryonic lethality in mice lacking hepatocyte growth factor/scatter
factor. Nature 373(6516), 702-705.
Urfer, R., Tsoulfas, P., O'Connell, L., Hongo, J.A., Zhao, W., Presta, L.G., 1998. High
resolution mapping of the binding site of TrkA for nerve growth factor and TrkC for
neurotrophin-3 on the second immunoglobulin-like domain of the Trk receptors. J. Biol.
Chem. 273(10), 5829-5240.
Venter, J.C., Adams, M.D., Myers, E.W. et al., 2001. The sequence of the human genome.
Science 291,1304-1351.
Wang, Y., Gu, X., 2000. Evolutionary patterns of gene families generated in the early stage
of vertebrates. J. Mol. Evol. 51, 88-96.
Wang, Y., Gu, X., 2001. Functional divergence in the caspase gene family and altered
functional constraints: statistical analysis and prediction. Genetics 158, 1311-1320.
Wiesmann, C, Ultsch, M.H., Bass, S.H., de Vos, A.M., 1999. Crystal structure of nerve
growth factor in complex with the ligand-binding domain of the TrkA receptor. Nature
401(6749), 184-188.
Williams, J.C., Wierenga, R.K., Saraste, M., 1998. Insights into Src kinase functions:
structural comparisons. Trends Biochem Sci. 23, 179-184.
27
CHAPTER HI: EVOLUTIONARY PERSPECTIVES FOR FUNCTIONAL
DIVERGENCE OF JAK PROTEIN KINASE DOMAINS AND TISSUE-SPECIFIC
GENES
A paper published in the Journal of Molecular Evolution1
Jianying Gu 2'3, Yufeng Wang2, Xun Gu 2'4
Abstract
Jak (Janus kinase) is a non-receptor tyrosine kinase, which plays important roles in
signal transduction pathways. The unique feature of Jak is that, in addition to a fully
functional tyrosine kinase domain (JHI), Jak possesses a pseudokinase domain (JH2).
Although JH2 lost its catalytic function, experimental evidence has shown that this domain
may have acquired some new but unknown functions. This apparent functional divergence
after the (internal) domain duplication may result in dramatic changes of selective constraints
at some sites. We have conducted a data analysis to test this hypothesis. Our result shows
that shifted selective constraints (or shifted evolutionary rates) between JH1 and JH2
domains are statistically significant. Predicted amino acid sites by posterior analysis can be
classified into two groups: very conserved in JH1 but highly variable in JH2, and vice versa.
Moreover, we have studied the evolutionary pattern of four tissue-specific genes, Jakl, Jak2,
Jak3 and Tyk2, which were generated in the early stages of vertebrates. We found that after
the (first) gene duplication, site-specific rate shifts between Jak2/Jak3and Jakl/Tyk are
significant, presumably as a consequence of functional divergence among these genes. The
implication of our study for functional genomics is discussed.
Introduction
1 Reprinted with permission of Journal of Molecular Evolution, 2002, 54, 725-733. 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Program of Bioinformatics and Computational Biology (BCB), Iowa State University. 3 Primary researcher and author 4 Author for correspondence
28
Jak protein, a group of non-receptor tyrosine kinases, plays a crucial role in the
cytokine signaling (Darnell et al. 1994; Leonard and O'Shea 1998). The Jak family consists
of four mammalian members: Jakl, Jak2, Jak3, and Tyk2 (Wilks et al. 1991; Harpur et al.
1992; Rane and Reddy 1994). Homologous Jaks from aves and teleosts, as well as a
Drosophila Jak homologue, hopscotch, have also been sequenced (Binari and Perrimon 1994,
Conway et al. 1997). The distinguishing feature of the Jak protein family is the existence of
tandem kinase (JH1) and pseudokinase (JH2) domains. The tandem kinase domain has all the
functional features of a typical tyrosine kinase domain, while the pseudokinase domain has
lost the catalytic activity (Guschin et al. 1995; Feng et al. 1997; Zhou et al. 1997).
Among metazoan protein tyrosine kinases (PTKs), only Jak protein maintains a
pseudokinase domain (JH2). The reason for this phenomenon remains unclear. JH2 has all
the subdomains that are shared by typical tyrosine kinase domains, indicating a common
origin with other kinase domains, e.g., via internal domain duplication. Thus, these two
domains in Jak proteins provide an interesting case study of functional divergence in proteins
from an evolutionary perspective. It is believed that the lack of some typical motifs, shared
by functional kinase domains, in the pseudokinase domain rendered this domain catalytically
inactive (Wilks et al. 1991; Frank et al. 1995). On the other hand, knock-out evidence has
shown either abrogation or stimulation of kinase activity when the JH2 domain is deleted in
Jak2, Jak3 or Tyk2 (Velazquez et al. 1995; Candotti et al. 1997; Luo et al. 1997). These facts
imply that after the internal domain duplication, the pseudokinase domain has undergone
loss-of-function and gain-of-function simultaneously. An interesting question in molecular
evolution is whether functional alterations in the JH2 domain result in considerable changes
in selective constraints (i.e., different evolutionary rate) at those sites involved. As a
consequence, some well-conserved sites in JH1 domain become highly variable in the JH2
domain, and vice versa. Such a pattern has been described as "covarion behavior" (Gaucher
et. al 2001), or Type I functional divergence (Gu 1999).
In this paper, we use Gu's (1999) method to test the hypothesis that altered selective
constraints at some positions occurred between the JH1/JH2 domains. Then, we studied the
evolutionary pattern among vertebrate Jak tissue-specific genes, and compared this to the
JH1/JH2 domain evolution. The implication of our results for functional divergence and
specification in Jak/Stat pathway is discussed.
29
Methods
We searched the Pfam database (http://pfam.wustl.edu/) for sequences that have
tandem kinase or pseudokinase domains of Jak genes. For comparison, the kinase domain
sequences of FGFR (Fibroblast Growth Factor Receptor) and EGFR (Epidermal Growth
Factor Receptor) gene families were also downloaded.
To find all available sequences that belong to Jak gene family, the Gapped BLAST and PSI-
BLAST searches were performed in several protein databases by using the human Jakl gene
as a query sequence (Altschul et al. 1997). After the exhaustive search, partial sequences and
redundant sequences were removed from further analysis. The final data set includes 22
complete vertebrate Jak sequences and 1 Drosophila homologous gene Hopscotch (see figure
4 legend for accession number and species).
Mw&zipk Analysis
The multiple hidden Markov Models (HMM) alignment of tandem kinase and
pseudokinase domains was obtained from Pfam, followed by manual editing. The multiple
alignment of complete Jak sequences was obtained using the program Clustal X (Thompson
et al. 1997). These multiple alignments are available upon request. A phylogenetic tree was
inferred by the neighbor-joining method (Saitou and Nei 1987) using the software MEGA2.0
(http://www.megasoftware.net/) (Kumar et al. 1994). Parsimony (PAUP) and likelihood
methods (PHYLIP) were also used for phylogenetic reconstruction to examine sensitivity to
tree-making method.
The ratio of nonsynonymous rate to synonymous rate (dn/ds) was calculated by using
a modified version of Gojobori and Nei's method (in the software MEGA2.0). When
assuming that the synonymous substitution is virtually neutral, dn/ds > 1 indicates positive
selection, dn/ds < 1 indicates negative selection, and dn/ds = 1 indicates neutral evolution.
Tacfmg f /zmcfiomoZ (o&ereJ aefecfiyg cofwArawf)
Type 1 functional divergence refers to the evolutionary process that results in altered
selective constraints (different evolutionary rates) between two duplicate genes, regardless of
30
the underlying evolutionary mechanisms (Gu 1999, 2001). Its functional-structural basis was
well illustrated in the case study of Caspase family (Wang and Gu 2001). Consider two gene
clusters generated by duplication. Briefly speaking, in each cluster an amino acid site is
called an Fi-site (functional divergence-related) if its evolutionary rate differs from that in
the ancestral gene; otherwise it is called an F0-site (i.e., functional divergence-unrelated).
Consequently, in the case of two gene clusters, a site can be in either of two states: (1) So or
(2) Si, where So is defined as a site being Fq in both clusters and Si is defined as a site being
Fi in at least one cluster. The coefficient of Type 1 functional divergence, 8A,B, between two
gene clusters A and B is defined as the probability of a site being in the S, state. Thus, a null
hypothesis of 0 = 0 means that the evolutionary rate is virtually the same between duplicate
genes at each site. Note that many models for rate variation among sites are the special case
of 0 = 0. A probabilistic model has been developed (Gu 1999) to estimate 0 by establishing
the relationship between this type of functional divergence and the observed amino acid
configurations.
Rejection of the null hypothesis (i.e., 0 > 0) provides statistical evidence for altered
selective constraints of amino acid sites after gene duplication. Thus, we can use a site-
specific profile to identify responsible amino acid sites. Specifically we define Qk to be the
posterior probability that site k is in state S\ (0 < Qk < 1). Large Qk indicates a high possibility
that the functional constraint (or, the evolutionary rate) of a site is different between two
clusters.
The software DIVERGE (http://xgul.zool.iastate.edu) was used for the functional
divergence analysis.
FKMcfzbfwz/ dw&zMce omzfyg «
One limitation of the two-cluster analysis described above is that it cannot test
whether one duplicate gene has more shifted evolutionary rates than the other one. This issue
is addressed by a simple method described below that can be applied when more than two
homologous gene clusters are available (Wang and Gu 2001). First, the Type I functional
distance between any two clusters is defined as dF = -In (1-0). Under the assumption of
independence, Wang and Gu (2001) have shown that dp is additive, i.e., for two clusters A
and B, dF (A,B) = bF (A) + bF (B), where bF(x) is the functional branch length of a given gene
31
cluster x. Large bp value for a gene cluster indicates the evolutionary conservation may be
shifted at many sites.
The estimated coefficients of Type 1 functional divergence (0) for all the pairs of
clusters can be used to create a matrix of ^ values. Given this matrix, a standard least
squares method can be implemented based on the formula dp (A,B) = bp (A) + bp (B) to
estimate bp for each gene cluster. If bp ~ 0, it indicates that the evolutionary rate of each site
in this duplicate gene has remained nearly the same since the gene duplication event,
indicative that the derived state is more similar to the ancestral state for this particular cluster.
Results
Pattern of functional divergence between tandem kinase and pseudokinase domains
The kinase and pseudokinase domains in Jaks
In addition to the regular kinase domain (JH1), Jak proteins contain a pseudo-kinase
domain (JH2), with a functional role that has not been clearly determined (Aringer et al.
1999). To explore the evolutionary pattern of JH1 and JH2 domains, we reconstructed a
neighbor-joining (NJ) tree, including Jaks and two closely related protein tyrosine kinases,
FGFR and EGFR. The inferred phylogeny shows that tandem kinase (JH1) and pseudokinase
(JH2) domains are evolutionary y distinct (Figure 1). Indeed, tandem kinase domains (JH1) in
Jaks appear to be more closely related to the functional kinase domains of FGFRs and
EGFRs, while the pseudokinase domains (JH2) of Jaks form a unique clade. In fact, we have
found that the JH2 domain had been generated before the emergence of most member genes
of the Protein tyrosine kinase supergene family (data not shown).
It has been suggested that the pseudokinase domains (JH2) no long exhibit the
catalytic activity, but may have acquired some new functions (Aringer et al. 1999). Thus, it is
interesting to test whether this functional divergence resulted in shifted selective constraints
(different evolutionary rates) at some sites between tandem kinase (JH1) and pseudokinase
(JH2) domains. To this end, we estimated the coefficient of functional divergence (0)
between JH1, JH2, EGFR and FGFR domains, as shown in the upper diagonal of Table 1. All
of the estimates are significantly greater than 0 (p <0.01). In particular, 0 jhi,jh2 ± se = 0.412
± 0.049 provides statistical evidence for supporting the hypothesis of altered selective
constraints between tandem kinase (JHI) and pseudokinase (JH2) domains in Jak proteins.
32
92
99r 99£l
99
70
99
99 C,
99
99
80
96 99
JAK1_HUMAN JAK1_MOUSE JAK1_CHICKEN
|—JAK1_COMMON CARP 99I—JAK1_ZEBRAFISH
JAK1_PUFFERFISH TYK2_HUMAN
TYK2_MOUSE TYK2_PUFFERFISH
8 J JAK2_HUMAN (ljAK2_PUFFERFISH Y JAK2_MOUSE
JlJAK2_RAT 8SJAK2_PIG —JAK2_ZEBRAFISH JAK2_ZEBRAFISH
JAK3_HUMAN JAK3_MOUSE
99>-JAK3_RAT — JAK3_CHICKEN
JAK3_COMMON CAR F
JH1
991—J' L-T-QQ*—J
99L - JAK3_PUFFERFISH — JAK_DROSOPHILA
EGFR_HUMAN -FGFR_HUMAN
JAK1_HUMAN JAK1_PfG JAK1_MOUSE
85^JAK1_RAT JAK1_CHICKEN JAK1_PUFFERFISH
JAK1_COMMON CARP JAK1_ZEBRAFISH
TYK2_HUMAN TYK2_MOUSE
TYK2_PUFFERFISH
EGFR FGFR
99 92
99
97
93r JAK2_HUMAN 99 JAK2_PIG
jr JAK2__MOUSE 67- JAK2_RAT
JAK2_PUFFERFiSH JAK2_ZEBRAFISH 991 JAK3_HUMAN
1 JAK3_RAT
JH2
81
991
-JAK3_CHICKEN JAK3_COMMON CARP JAK3_PUFFERFISH
0.1
Figure 1. The NJ tree of Jaks, FGFRs and EGFRs based on the sequence alignment of kinase
domains. The statistical reliability of the inferred tree topology was assessed by the bootstrap
technique (Felsenstein 1985).
33
Table 1: 0Ab values from all the combinations of the pair-wise comparisons for Jaks, FGFRs,
and EGFRs.
JHI JH2 FGFR EGFR
JHI 0.412±0.049 0.22210.055 0.496±0.071
JH2 0.531 0.566±0.080 0.561±0.080
FGFR 0.251 0.835 0.432±0.096
EGFR 0.685 0.823 0.566 Note: JH1: the tandem kinase domain in Jak; JH2: the pseudokinase domain in Jak. The upper diagonal shows 0AB for all the pair-wise comparisons for Jaks, FGFRs, and EGFRs, where 0As is defined as the coefficient of the functional divergence between cluster A and cluster B. The lower diagonal shows dp(A,B)= = -In (1-0) which is defined as the functional distance between cluster A and cluster B.
To further test the hypothesis that JH2 is more functionally divergent than JH1, we
conducted the functional distance analysis (Wang and Gu 2001). The functional distances
between domains [ dp - -In (1-0) ] are shown in the lower diagonal of Table 1. Subsequently,
the functional branch length (bF) for each domain was estimated by the least-squares method,
which can be illustrated as a star-like tree (Figure 2). The null hypothesis of equal functional
branch lengths is rejected (p < 0.05). Interestingly, the functional branch length of the
pseudokinase domain (JH2) of Jak proteins is about four times greater than that of the kinase
domain (JHI), indicating that the distinct functional role of JH2 is likely generated by the
episodic evolution of kinase domains.
Important amino acid sites for altered functional constraints between JHI and JH2
A site-specific profile based on the posterior probability (Qk) is used to identify
critical amino acid sites that are responsible for functional divergence between tandem kinase
(JHI) and pseudokinase (JH2) domains. Among 212 amino acid sites, 154 (73%) sites have
Qk < 0.5 implying low probability of contribution to functional divergence. For the remaining
58 amino acid sites with Qk > 0.5, 21 of them show a very high probability of being
functional divergence-related [Qk > 0.9]. These 21 sites can be definitively grouped into two
34
— JHI (Kinase domain)
_ EGFR (Kinase domain)
— FGFR (Kinase domain)
— JH2 (pseudokinase domain) 0.479 (JH2)
Figure 2. The tree-like topology of kinase domains of Jaks, FGFRs, and EGFRs in terms of
the functional distance bp.
Table 2. Predicting critical amino acid sites responsible for functional divergence between
tandem kinase and pseudokinase domain in Jaks.
I H Position (A:) & Position^) Gt
26 0.997 38 0.998 203 0.992 47 0.967
30 0.975 41 0.960 20 0.968 190 0.959 46 0.967 105 0.956 23 0.966 103 0.954
162 0.963 44 0.931 70 0.959 121 0.918
137 0.957 87 0.906 21 0.922
159 0.913 196 0.906
0.118 (JHI)
0.421 (EGFR)
0.21 (PGFR)
Category I: conserved in the tandem kinase domain; variable in the pseudokinase domain; Category II: conserved in the pseudokinase domain; variable in the tandem kinase domain.
35
categories: (1) conserved in the tandem kinase (JHI) domain, but variable in the
pseudokinase (JH2) domain; (II) conserved in the pseudokinase domain, but variable in the
tandem kinase domain (Table 2).
Category I: Of the 12 sites belonging to this category, one site, site 137 (QJS? =
0.957), has been demonstrated to be a determining site for the function of the tandem kinase
domain (JHI), corresponding to the second tyrosine (highlighted Y) of a conserved (E/D)YY
motif in Jak2 protein. This motif, which is located in the activation loop of Jak2, regulates
the kinase activity by autophosphorylation (Feng et al. 1997). In Tyk2, these two consecutive
tyrosines (YY) have also been identified as phosphorylation sites (Gauzzi et al. 1996).
Interestingly, the multiple alignments clearly show that site 137 is invariant in tandem
kinase (JHI) domains. In contrast, the same position in pseudokinase domains (JH2) has a
variety of amino acids with very different chemical properties. For example, some JH2
domains have amino acids with non-polar side chains such as glycine, alanine and proline,
and some of them have uncharged polar amino acids like serine and threonine (Figure 3A).
This observation can be explained as a relaxed selective constraint that was caused by loss-
of-function in phosphorylation in JH2 domains.
Category 11: Nine predicted sites belong to this category (Figure 3B). Among them,
site 103 is predicted to be highly functional-divergence related [QJOS = 0.954], Experimental
data show that a glutamic acid (E)-to-lysine (K) replacement occurring at this site in the
pseudokinase (JH2) domain hyperactivated Jak-Stat pathway in Drosophila and mammalian
species (Luo et al. 1997). It seems likely that after the internal domain duplication, the
tandem kinase domain (JHI) largely maintained the original catalytic function, while the
pseudokinase domain (JH2) may have achieved some unidentified new functions, resulting in
a set of JH2-specific conserved sites.
Pattern of functional divergence between Jak member genes
Functional divergence of Jak gene family
Figure 4 shows the neighbor-joining (NJ) tree of four Jak gene members (JakI, Jak2,
Jak3 and Tyk2). Parsimony method (PAUP) and likelihood method (PHYLIP) give
essentially the same topology (data not shown). The phylogenetic analysis indicates that Jak
member genes were generated by two gene duplications in the early stage of vertebrates, i.e.,
36
(A) (B)
JHI
JH2
11112 1111 222234735690 344480029 013606079263 814773510
HUMAN DPDGALPYYSRC HUMAN KERYQGRST MOUSE DPDGALPYY SRC MOUSE KERYQGRST CHICKEN DPDGALPYYSRC
Jakl CHICKEN KERYQGRSV
CARP DPDGALPYYSRC CARP WHRYTARNV ZEBRAFISH DPDGALPYYSRC ZEBRAFISH WHRYTGRNV PUFFERFISH DPDGALPYYSRC PUFFERFISH SDKFTGKNL HUMAN DPDGALPYSSRC HUMAN EEKQKGKNL MOUSE DPDGALPYS SRC MOUSE EEKQKGKNL RAT DPDGALPYSSRC Jak2 RAT EEKQKGKNL PIG DPDGALPYSSRC
Jak2 PIG EEKQKGKNL
PUFFERFISH DPDGALPYSSRC PUFFERFISH EEKQKGKNL ZEBRAFISH DPDGALPYSSRC ZEBRAFISH EEKQKAKSL HUMAN DPHGALPYSSRC HUMAN QQKHRGRSL MOUSE DPDGALPYSSRC MOUSE QQKHRGRSL RAT DPDGALPYSSRC Jak3 RAT QQKHRGRSL CHICKEN DPDGALPYSSRC CHICKEN EQHQTGQSL CARP DPDGALPYSSRC CARP QQSHRQMSF PUFFERFISH DPDGALPYSSRC PUFFERFISH KKSHRQLSI HUMAN DPDGALPYYSRC HUMAN KDRYQHHNL MOUSE DPDGALPYYSRC Tyk2 MOUSE QERYQHQNL PUFFERFISH DPDGALPYSSRC PUFFERFISH INKDQHKRL
HUMAN MDEKIVESSARK HUMAN FAMSWEKRF PIG MDEKIVESSARK PIG FAMSWEKRF MOUSE LDEKI VETS ARK MOUSE FAMSWEKRF RAT LDEKIVETSARK
Jakl RAT FAMSWEKRF
CHICKEN LNNELVESSAMK Jakl CHICKEN FAMSWEKRF CARP KLYEIIQSSAQE CARP FAMSWEKRF ZEBRAFISH KPYEVIQSTAQD ZEBRAFISH FAMSWEKRF PUFFERFISH RVSEWQTSAQT PUFFERFISH FAMSWEKRF HUMAN REQELLKPNTQA HUMAN FAMSWENRF MOUSE REQKLLKPNTQT MOUSE FAMSWEKRF RAT REQELLKPTTQT RAT FAMSWEKRF PIG REQELLKPNTQT JaK2 PIG FAMSWEKRF PUFFERFISH KELQVLKPSAQI PUFFERFISH FAMSWEKRF ZEBRAFISH REEKVLRPSAQT ZEBRAFISH FAMSWEKRF HUMAN HEEKLVHS SAQT HUMAN FAMSWEKRF RAT REEDLVY SNAQT RAT FAMSWEKRF CHICKEN RDEQVLRAAAQS Jak3 CHICKEN FAMSWEKRF CARP TDVTLIKGECNT
Jak3 CARP FAMSWEKRF
PUFFERFISH SNGRFFEGTSQT PUFFERFISH FAMSWENRF HUMAN RVREWES SMRP HUMAN FAMSWEKRF MOUSE RVSQWESGTQP Tyk2 MOUSE FAMSWEKRF PUFFERFISH QVSDVLKTRPRK
Tyk2 PUFFERFISH FVMSWEKRF
Figure 3. Functional divergence related amino acid sites candidate. (A) category I:
conserved in tandem kinase domains (JHI ), variable in pseudokinase domains (JH2); (B)
category II: conserved in pseudokinase domains (JH2), variable in tandem kinase domains
(JHI).
37
">nd
\ 99
99
97
99
. MOUSE
RAT
HUMAN
PIG — PUFFERFISH
- ZEBRAFISH
HUMAN
MOUSE
RAT Cc' CHICKEN
CARP
— PUFFERFISH
HUMAN
MOUSE — PUFFERFISH
83
«Lrt
83r HUMAN
PIG
MOUSE
- CHICKEN
99 CARP
L_ 2 99
ZEBRAFISH
PUFFERFISH
Jak2
Jak3
]ïyk2
Jakl
DROSOPHILA
0.1
Figure 4. The phylogenetic tree of the Jak gene family. The neighbor-joining algorithm was
used to infer the topology based on the multiple sequence alignment with Poisson distance.
Bootstrap scores >50% are presented. 1st and 2nd represent the time points of two rounds of
gene duplications. Accession numbers: Jak2: L16956(mouse); U13396(rat);
AF005216(human); AB006011 (pig); AF090382(pufferfish); AJ005690(zebrafish). Jak3:
U09607(human); L40172(mouse); D28508(rat); AF034576(chicken); AF148993(carp);
AF091238(pufferfish). Tyk2:X54637(human); AF173032(mouse); AF090383 (pufferfish).
Jakl: M64174(human); AB036335(pig); M33425(mouse); AF096264 (chicken);
L24895(carp); U82980(zebrafish); U53213(pufferfish). Hopscotch: L26975{Drosophila)
38
before the emergence of teleosts. The first gene duplication resulted in the common ancestor
of Jak2/3 and Tyk2/Jakl, followed by the second one resulting in the current four member
genes. Note that Wang and Gu (2000) have shown that many tissue-specific gene families
show a similar pattern, raising a possibility of large-scale duplication(s) in early vertebrates.
As nucleotide change in the regulatory region after gene duplication may be the first
step for duplicate gene preservation (Force et al. 1999), amino acid replacements are
responsible for the functional divergence for specification among tissue-specific isoforms. To
explore whether (site-specific) altered selective constraints occurred during Jak family
evolution, we estimated 0 between Jak member genes, as well as between two groups (Jak2/3
and Tyk2/Jakl) that was generated by the first round gene duplication. The estimation is
based on the phylogenetic tree in figure 4. We have found that (1) (site-specific) altered
selective constraint after the first gene duplication is statistically significant, and (2) the 0
value between Jak2 and Jak3 is 0.019 ± 0.059, indicating overall no significant site-specific
shift of evolutionary rate between them (Table 3 A).
Furthermore, we have estimated the coefficients of Type I functional divergence
between Jak genes for three separate regions: (1) the tandem kinase domain (JHI), (2) the
pseudokinase domain (JH2), and (3) the surrounding region excluding the JHI and JH2
domains (Table 3B). Our results can be summarized as follows. First, after first-round gene
duplication, site-specific altered selective constraint in the JHI domain is statistically
significant (p < 0.01), but not significant in the JH2 domain; there is weak evidence in the
surrounding region (p ~ 0.05). Second, after the second gene duplication, no statistical
evidence is observed for altered selective constraint in any domain. Since the estimate of 0 in
the JHI domain tends to be higher than those of JH2 domain or the surrounding region, we
conclude that JHI domain is mainly responsible for functional divergence among Jak tissue-
specific genes.
Isoform-specific functional divergence of Jak gene family
The functional branch lengths (bp) of Jakl, Jak2, and Jak3, can be estimated from the
pair-wise estimates of 0 between Jak 1, Jak2, and Jak3, respectively. Table 4 shows the
results for the whole length of Jak proteins, and three domains (JHI, JH2, and the
surrounding region), respectively. Interestingly, the level of altered selective constraints of
39
Table 3. The pair-wise coefficient of functional divergence (844) for different member genes
of Jak gene family. (A) 0 values for the full length Jak proteins; (B) 0 values for the different
regions of Jak proteins.
(A) Cluster 1 Cluster 2 Gene 8 ± se ̂(ML) LRT^ P value
duplication ^ Jakl Jak2 I 0.208 ± 0.056 13.609 <0.0005 Jakl Jak3 I 0.158 ±0.051 9.502 0.002 Jak2 Jak3 II 0.019 ± 0.059 0.110 0.75
Jakl/Tyk2 Jak2/Jak3 I 0.072 ± 0.027 7.292 0.007 Jakl Jak2/Jak3 I 0.147 ±0.039 14.367 <0.0005
Note: As a single gene cluster, Tyk2 was excluded from the analysis due to the insufficient number of sequences.
(B) Cluster 1 Cluster 2 Gene JHI JH2 Surrounding Full length
duplication
Jakl Jak2 I 0.435+0.140 0.21410.125 0.294+0.100 0.208±0.056 Jakl Jak3 I 0.231+0.086 0.052±0.143 0.178+0.089 0.158+0.051
Jak2 Jak3 II 0.217±0.150 0.001±0.022 0.022+0.100 0.019±0.059
Jakl/Tyk2 Jak2/Jak3 I 0.122±0.050 0.094±0.068 0.101±0.053 0.072±0.027 Jakl Jak2/Jak3 I 0.174±0.067 0.086±0.097 0.225±0.074 0.147±0.039
(1) I represents the first gene duplication; II stands for second gene duplication. (2) se represent standard error.
(3) likelihood ratio statistic.
Table 4. Functional branch lengths (bp) of different regions in three Jak isoforms.
JHI JH2 Surrounding Full length dg/d, (Full length)
Jakl 0.295 0.147 0.261 0.193 0.074
Jak2 0.277 0.095 0.087 0.040 0.089
Jak3 -0.032 -0.094 -0.065 -0.021 0.274
40
member genes, measured by bF, follows bp (Jakl) > bp (Jakl) > bp (JakS), while the level of
altered selective constraints of domains follows JHI > surrounding region > JH2. In
particular, the bp for Jak3 is virtually zero for all three domains, whereas JH2 (the
pseudokinase domain) shows no significant functional branch length in Jak2 and Jak3.
Table 4 also includes the ratios of nonsynonymous to synonymous rates (dn/ds) of
Jak member genes, based on human-mouse orthologous members of a gene, which can be
used to measure the difference of selective constraints among member genes of a gene family
(Tsunoyama et al. 1998). Interestingly, Jak3, which has virtually bp = 0, shows the highest
dn/ds ratio. This observed negative association between dn/ds and bF can be interpreted as
follows. In the early stage after gene duplication, the functional divergence may occur in one
of two lineages (measured by large bF value). If this process leads to the acquisition of some
new functions, a stronger functional constraint (measured by low dn/ds value) is expected.
In summary, during Jak tissue-specific gene evolution, Type I functional divergence
(i.e, site-specific altered selective constraints) occurs mainly in the JHI domain of Jakl and
Jak2 genes. As Jakl and Jak2 genes may have evolved specialized functions in their kinase
domains, Jak3 is more likely to have inherited the ancestral function.
Predicting important amino acid sites for Type I functional divergence
Critical amino acid sites responsible for the Type I functional divergence among Jak
member genes can be predicted by Qk, the posterior probability of being functional
divergence-related at site k (Gu 1999). Figure 5 shows the site-specific profile between
Jakl/Tyk2 and Jak2/Jak3 clusters, measured by Qk. It clearly shows that, after the first gene
duplication, only a small portion of sites have undergone shifted rates, as indicated by high
posterior probabilities. These sites may be considered as prime candidates for further
functional assays.
Discussion
Domain and gene duplications are both important for Jak gene family proliferation
and evolution. After gene (or domain) duplication, one copy mainly retains the original
function, whereas the other copy is under relaxed evolutionary constraint that may result in
functional divergence and/or specification (Ohno 1970; Li 1983). In this paper, we have
41
Jak1/Tyk2 vs. Jak2/3 0.8
0.6 X
(f)
0.2
1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951
alignment site
Figure 5. Site specific profiles (Q&) for clusters Jakl/Tyk2 vs. Jak2/Jak3.
investigated the evolutionary pattern of tandem kinase (JHI) and pseudokinase domains
(JH2), a unique feature in the super-gene family of tyrosine protein kinases. Our results are
summarized as follows.
In the very early stage of animals, an internal duplication had occurred within the
ancestor of Jak gene, producing two kinase domains. As one domain (JHI) largely retains the
original kinase activity, the other one (JH2) is free to accumulate amino acid replacements
because of functional redundancy. When some key sites were replaced by other amino acids
without serious deleterious effects on the survival, the kinase function of the JH2 domain was
lost. Consequently, some well-conversed sites in functional kinase domains turn out to be
highly variable. Meanwhile, the JH2 domain appears to have acquired some new functions
that may explain its long-term existence during evolution. Despite the lack of substantial
evidence for a physiological role of JH2, we indeed observed a group of sites in the JH2
domain to be very conserved, whereas they were variable in the JHI domain.
It seems likely that the functional divergence between JHI and JH2 domains had
completed before the origin of vertebrates. In the early vertebrate lineage, two rounds of gene
duplications have generated four tissue-specific vertebrate i so forms. Type I functional
divergence (altered selective constraint) is significant after the first round of gene
duplication. This divergence occurred mainly in the JHI domain, but not the case in the JH2
42
0.8
0.6 g ë 0.4
0.2
relative alignment position
Figure 6. The correlation of 9 value with sequence alignment. Y-axis represents different 0
values corresponding to the alignment by sliding positions (-10 to +10) relative to the
manually adjusted alignment of kinase domains in Jaks.
domain, indicating that the JHI domain probably plays a major role of in functional
specification among tissue-specific isoforms.
Since our analysis is based on the amino acid alignment, reliability of the multiple-
alignment is crucial for our interpretation. It has been indicated that the coefficient of Type I
functional divergence (0) can be overestimated by the misaligned sequences (Gu 1999,
2001). Indeed, we found a larger estimate of 8 (8 = 0.615) between JHI and JH2 for the
unedited Pfam domain HMM alignment, which includes several apparent misaligned sites.
After making manual corrections at these sites according to the consensus sequence of
protein kinase domains that have been confirmed by the secondary structure, the estimate of
0 is reduced to 0.439. Furthermore, we tested the effect of the alignment by shifting the
alignment window between JHI and JH2 from -10 to +10 positions. Figure 6 shows that all
estimates of 0 that resulted from the shifted alignment are extremely high. Our study suggests
that a reasonably good alignment is a prerequisite for estimating the level of functional
divergence.
Gu's (1999) method applies only to the Type I functional divergence that resulted in
site-specific altered selective constraints among homologous genes within a gene family. For
43
example, the amino acid site 137 shows a typical Type I pattern, i.e. invariant tyrosine in all
tandem kinase domains, but highly variable in pseudokinase domains. It should be noted that
it is only one of many approaches to detect functional divergence from the evolutionary
perspective. As indicated by many authors (e.g., Casari et al. 1995; Livingstone et al. 1996;
Gaucher et al. 2001) functional divergence may result in dramatic change in amino acid
properties but not in selective constraints, which is also called Type II functional divergence
(Gu 1999). For instance, cluster-specific sites (or diagnosis site) usually refer to those sites
that, though highly conserved within gene clusters, consist of amino acids that are
dramatically different (e.g., positive-charged vs. negative-charged) between clusters. Another
well-cited example is that site-by-site dependence may be related to the site interaction in the
protein's 3D structure (Pollock et al. 1999). A sophisticated model that includes all these
evolutionary aspects of functional divergence of proteins is certainly desirable. However,
because of large volume of unknown parameters, its efficiency in practice often becomes the
major concern. Therefore, a specific model is useful to test the statistical significance of a
specific evolutionary perspective. Combined with previous studies (e.g., Gu 1999, Gaucher et
al. 2001 ; Wang and Gu 2001), our analysis has indicated that site-specific altered selective
constraint occurs after gene duplications as a result of functional divergence in protein
evolution.
Acknowledgement: We are grateful to Dr. William Nordstrom and Peter Vedell for
constructive comments. This study was supported by NIH grant ROI GM62118 (to X. G). X.
Gu is the DuPont Young Professor.
References
Altschul SF, Madden TL, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman DJ (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 25: 3389-3402
Aringer M, Cheng A, Nelson JW, Chen M, Sudarshan C, Zhou YJ, O'Shea J J (1999) Janus
kinases and their role in growth and disease. Life Sciences 64: 2173-2186
Binari R, Pern mon N (1994) Stripe-specific regulation of pair-rule genes by hopscotch, a
putative Jak family tyrosine kinase in Drosophila. Genes Dev. 8: 300-312
44
Candotti F, Oakes SA, Johnston JA, Giliani S, Schumacher RF, Mella P, Fiorini M, Ugazio
AG, B ado lato R, Notarangelo LD, Bozzi F, Macchi P, Strina D, Vezzoni P, Blaese RM,
O'Shea JJ, Villa A (1997) Structural and functional basis for JAK3-deficient severe
combined immunodeficiency. Blood 90: 3996-4003
Casari G, Sander C, Valencia A (1995) A method to predict functional residues in proteins.
Mzf. Arwcf. BmZ. 2: 171-178
Conway G, Margoliath A, Wong-Madden S, Roberts RJ, Gilbert W (1997) Jakl kinase is
required for cell migrations and anterior specification in zebrafish embryos. Proc. Natl. Acad.
USA 94: 3082-3087
Darnell JE Jr, Kerr IM, Stark GR (1994) Jak-STAT pathways and transcriptional activation
in response to I FN s and other extracellular signaling proteins. Science 264: 1415-1421
Felsenstein J (1985) Confidence limits on phytogenies: an approach using the bootstrap.
Evolution 39: 783-791
Feng J, Witthuhn BA, Matsuda T, Kohlhuber F, Kerr IM, Ihle JN (1997) Activation of Jak2
catalytic activity requires phosphorylation of Y1007 in the kinase activation loop. Mol. Cell.
17: 2497-2501
Force A, Lynch M, Pickett FB, Amores A, Y an YL, Postlethwait J (1999) Preservation of
duplicate genes by complementary, degenerative mutations. Genetics 151: 1531-1545
Frank SJ, Yi W, Zhao Y, Goldsmith JF, Gilliland G, Jiang J, Sakai I, Kraft AS (1995)
Regions of the JAK2 tyrosine kinase required for coupling to the growth hormone receptor.
J. gfof. CAem. 270: 14776-14785
Gaucher EA, Miyamoto MM, Benner SA (2001 ) Function-structure analysis of proteins
using covarion-based evolutionary approaches: Elongation factors. Proc. Natl. Acad. Sci.
USA 98: 548-552
Gauzzi MC, Velazquez L, McKendry R, Mogensen KE, Fellous M, Pellegrini S (1996)
Interferon-a-dependent activation of Tyk2 requires phosphorylation of positive regulatory
tyrosines by another kianse. J. Biol. Chem. 271: 20494-20500
Gu X (1999) Statistical methods for testing functional divergence after gene duplication.
Mol. Biol. Evol. 16: 1664-1674
Guschin D, Rogers N, Briscoe J, Witthuhn B, Watling D, Horn F, Pellegrini S, Yasukawa K,
Heinrich P, Stark GR, et al. (1995) A major role for the protein tyrosine kinase JAK1 in the
45
JAK/STAT signal transduction pathway in response to interleukin-6. EMBO J. 14: 1421-
1429
Harpur AG, Andres AC, Ziemiecki A, Aston RR, Wilks AF (1992) Jak2, a third member of
the JAK family of protein tyrosine kinases. Oncogene 7: 1347-1353
Kumar S, Tamura K, Nei M (1994) MEGA: Molecular Evolutionary Genetics Analysis
software for microcomputers. Comput. Appl. Biosci. 10: 189-191
Landgraf R, Fischer D, Eisenberg D (1999) Analysis of heregulin symmetry by weighted
evolutionary tracing. Protein Eng. 12: 943-951
Leonard WJ and O'Shea J J (1998) Jaks and STATs: biological implications. Annu Rev
//MfMwnoZ. 16: 293-322
Li WH (1983) Evolution of duplicate genes and pseudogenes. In:Nei M and Keohn RK (eds)
Evolution of Genes and Proteins. Sinauer Associates, Sunderland, MA, pp. 14-37
Li WH (1993) Unbiased estimation of the rates of synonymous and nonsynonymous
substitution. J Mol. Evol. 36: 96-99
Lichtarge O, Bourne HR, Cohen FE (1996) An evolutionary trace method defines binding
surfaces common to protein families../. Mol. Biol. 257: 342-358
Livingstone CD, Barton GJ ( 1996) Identification of functional residues and secondary
structure from protein multiple sequence alignment. Methods Enzymol. 266: 497-512
Luo H, Rose P, Barber D, Hanratty WP, Lee S, Roberts TM, D Andrea AD, Dearolf CR
(1997) Mutation in the Jak kinase JH2 domain hyperactivates Drosophila and mammalian
Jak-Stat pathways. Mol. Cell. Biol. 17: 1562-1571
Ohno S (1970) Evolution by gene duplication. Springer, New York.
Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum
likelihood identification and relationship to structure. J. Mol. Biol. 287:187-198.
Rane SG, Reddy EP (1994) Jak3: a novel JAK kinase associated with terminal differentiation
of hematopoietic cells. Oncogene 9: 2415-2423
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol. Biol. Evol. 4: 406-425
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL_X
windows interface: flexible strategies for multiple sequence alignment aided by quality
analysis tools. Nucleic Acids Res. 25: 4876-4882
46
Tsunoyama K, Gojobori T (1998) Evolution of nicotinic acetylcholine receptor subunits.
Mol. Bio. Evol. 15: 518-527
Velazquez L, Mogensen KE, Barbieri G, Fellous M, Uze G, Pellegrini S (1995) Distinct
domains of the protein tyrosine kinase tyk2 required for binding of interferon-alpha/beta and
for signal transduction. J. Biol. Chem. 270: 3327-3334
Wang Y, Gu X (2001 ) Functional divergence in Caspase gene family and altered functional
constraints: statistical analysis and prediction. Genetics. 158: 1311-1320
Wilks AF, Harpur AG, Kurban RR, Ralph SJ, Zurcher G, Ziemiecki A (1991) Two novel
protein-tyrosine kinases, each with a second phosphotransferase-related catalytic domain,
define a new class of protein kinase. Mol. Cell. Biol. 11: 2057-2065
Zhou YJ, Hanson EP, Chen YQ, Magnuson K, Chen M, Swann PG, Wange RL, Changelian
PS, G Shea JJ (1997) Distinct tyrosine phosphorylation sites in Jak3 kinase domain positively
and negatively regulate its enzymatic activity. Proc. Natl. Acad. Sci. USA 94: 13850-13855
47
CHAPTER IV: EVOLUTIONARY ANALYSIS OF FUNCTIONAL DIVERGENCE IN
TGF-B SIGNALING PATHWAY
A paper pulished in the Information Sciences1
Jianying Gu 2, Xun Gu2'3
Abstract
TGF-(3 cytokines and their interacting receptors form an intricate signaling network
and play important roles in cell differentiation and development. Members in gene families
involved in this complicated pathways are usually conserved in sequence but distinct in
temporal and tissue-specific expression. Starting from an evolutionary perspective, we have
statistically tested the functional divergence in major pathway components, targeted the
critical amino acid residues responsible for functional divergence. Our results show that
altered functional constraints is a common pattern in the evolution of TGF-(3 pathway.
Moreover, we have found the correlation between structural divergence and functional
divergence in the extracellular ligand-binding domain of Type II receptors. This study helps
understand the complex intrinsic mechanisms of signaling.
1. Introduction
The transforming growth factor (3 (TGF-(3) family plays indispensable roles in tumor
suppression, cell proliferation, cell differentiation, tissue morphogenesis, lineage
determination, cell motility and apoptosis [1], TGF-j3 and related factors regulate gene
expression by bringing together Type I and Type II receptors of serine/threonine protein
kinases. Interestingly, TGF-(3 and its downstream receptors are expressed in a well-
coordinated temporal and spatial pattern. This typical ligand-receptor signaling mode
provides a basis for understanding the complex regulation network in higher animal
kingdom. Noticeably, considerable sequence similarity has been found across species in gene
1 Reprinted with permission of Information Sciences, 2002,145(3-4), 195-204. 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Iowa State University. 3 Author for correspondence
48
families involved in TGF-(3 pathway, which implies that certain level of functional constraint
exerts during evolution [2], Thus, an interesting question is addressed: what lies beneath the
similar member genes to make the differential expression pattern of each component in this
array of signaling? In this study, we shall perform comprehensive statistical analysis to
investigate such functional divergence among conserved families of TGF-(3 pathway
components from an evolutionary perspective.
2. Methods
2.7. iSegwefices
Amino acid sequences of TGF-J3 family, Type I receptor family and Type II receptor
family were searched against the Non-redundant Protein databases at NCBI by the PSI-
BLAST. Partial (<80% of full length) and hypothetical sequences were removed from further
analysis. The accession numbers are available upon request.
2.2. Multiple Alignments and Phylogenetic Analysis
The multiple alignments were obtained by Clustal X. Phylogenetic trees were inferred
by the neighbor-joining (NJ) method using MEGA2.0 (http:// www. megasoftware.net/).
PAUP4.0 and PHYL1P were used to examine the robustness of tree topology.
2.3. Statistical Analysis on Functional divergence - DIVERGE software
It is well recognized that gene duplication is one of major resources for functional
innovation [3,4]. The functional divergence refers to a particular type of functional
innovation after gene duplication that resulting in altered functional constraints (i.e., different
evolutionary rates) between two duplicate genes [5, 6],
Under a two-state probabilistic model, for each duplicate cluster, the evolutionary rate
at some sites may differ from the ancestor, these sites are called F,-sites (functional
divergence-related); otherwise they are called F0-sites. Rate difference between duplicates is
observed only if an Fi site is present in at least one cluster: a so-called S, status. We define
the coefficient of type I functional divergence (0) as the probability of a site being involved in
functional divergence in at least one gene cluster, i.e., 0 = P(S,). In contrast, a site being F0 in
both clusters holds a So status (i.e., the evolutionary rate of each duplicate remains the same
49
as the ancestral state). Maximum likelihood estimate of 0 can be obtained when the
phylogeny of the gene family is given [6],
If 0 is statistically greater than 0, it indicates that functional divergence may have
occurred after the gene duplication. To target the critical amino acid residues responsible for
functional divergence, a posterior analysis is developed to score the probability for each site
being functional divergence related, i.e. P((St|X), where X is the observed amino acid
configuration.
The detailed statistical framework is discussed in Gu [5, 6], and Wang & Gu [7]. The
computational package DIVERGE can be downloaded at http://xgul.zool.iastate.edu/.
2.4. Dating gene duplication events
Linearized NJ trees were used to convert the (average) distance to the geological time
scale. To date the gene duplication events, several speciation time points were used for
calibration: primate-rodent (80 million years ago,my a), mammal-bird (310 mya), mammal-
amphibian (350 mya), tetrapod-teleost (430 mya), vertebrate-Drosophila (830 mya) [8].
3. Results and Discussion
3.1. A scenario of TGF-/3 pathway: Ligand-Receptor Signaling
The most striking feature of TGF-(3 pathway is its fine modulation on a series of
different factors through a tight ligand-receptor interaction. The ligands, TGF-(3 family
members bind to receptors (Type II and Type I, consecutively) that possess intrinsic
serine/threonine kinase activity. Upon ligand binding the receptors form an active
heterotetrameric complex which initiates down-stream intracellular signaling [1],
J. 2. Fw/icfwfwzZ Divergence m Z/gawk
A variety of TGF-(3 ligands with distinct signals have been identified [2]. They
belong to three major families: BMP/GDF (bone morphogenetic proteins/growth and
differentiation factors), TGF-j3, Activin/lnhibin, each of which has undergone substantial
gene duplications. Table 1 shows the ML estimates of 0 values (coefficients of functional
divergence) for pairwise comparison of major member genes. Among 10 estimates, all except
one are significantly greater than 0, suggesting that altered functional constraint after gene
50
Table 1. The pairwise coefficient of functional divergence (0) in TGF-j3 superfamily.
subfamily Cluster 1 Cluster 2 0 ± s.e.
TGF-P TGF-P1 TGF-P2 0.338 ± 0.123 TGF-Pl TGF-P3 0.274 ± 0.098 TGF-P2 TGF-P3 0.619 ± 0.126
Activin Activin PA Activin pB 0.164 ± 0.088 Activin PA Activin PC/E 0.287 ± 0.138 Activin (3B Activin PC/E 0.115 ± 0.105*
BMP BMP2 BMP4 0.250 ± 0.066 BMP2/4 BMP7 0.482 ± 0.074 BMP2/4 BMP9/10 0.330 ± 0.069 BMP7 BMP9/10 0.513 ± 0.111
* we fail to reject the null hypothesis: 0 = 0.
duplication is a general pattern to achieve novel functionality. In particularly, we have
surveyed the pattern of functional divergence within BMP2/4 and TGF-(3s.
j.2.7.
Both BMP2 and BMP4 are important factors for bone morphogenesis in vertebrate
[9]. Tissue-specific expression has been found: in the early limb bud, BMP-4 is found in the
mesenchyme and in the ectodermal ridge at the distal end, whereas BMP-2 is only found in
the ridge, but not in the mesenchyme [10]. Two closely related clades in the NJ tree suggests
that BMP2 and BMP4 arc likely to be generated by a gene duplication event in the very early
vertebrate stage, i.e., after the emergence of amphioxus and before the divergence of
tetrapods and bony fishes, approximately 588 mya (Fig. 1A). Extensive studies have
accumulated evidence that this short time span (430-700 mya) is important for the
emergence of major molecular pathways such as signal transduction, and tissue-specific
isoforms in vertebrates [11].
The overall functional divergence between BMP2 and BMP4 is indexed by the
qf/zmcfzoW dzverggfzcg 8 (0.250 ± 0.066). We statistically rejected the null
hypothesis H0: 0 = 0, and proved that altered functional constraints occurred in at least some
of amino acid residues. By using the posterior analysis, we predicted that among total of
aligned 338 sites, 26 sites are critical for the mutual functional divergence, with the cut-off
51
(A) (B) Category I Category II
®j— HUMAN 81 j- RABBIT
99jl DEER
[_j— MOUSE
oq '—- RAT
CHCKEN
CLAWED FROG
WESTERN FROG
ZEBRAFISH—»
BMP-2
' r iwuuac fl RAT - DEER
- RABBIT
- CHCKEN
i- CLAWED FROG i- WESTERN FROG
ZEBRAFISH
BMP-2/4 DPP
HUMAN RABBIT DEER MOUSE RAT CHICKEN FROG ZEBRAFISH
HUMAN MOUSE RAT DEER RABBIT CHICKEN FROG ZEBRAFISH
1111222 1357882379233 1649056127314
LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR
ARPNSLPLIGRSK ARPSSLPLIGRPK ARPSSFPLIGRPK ARPSSFPLIGRPK ARPNSLPLVGRPK
TLSAPVLAIGAGN
VRPNGVIMIQSRK ALSETVPLVLAGR
112 44456789132
4501709097322
EVRRGRSLSIFVK DIRRGRSLTIFAR DVRLGRSLTIFVR DVRRGRAVSLFVK DVRRGRAISLFVK DLRLLPVLTVYVK RVHLAQSMSVKVK NFYMDHAFTIGLS
ELRLEYHLSIGIR ELRLEYHLSIGIR ELRLEYHLSIGIR ELRLEYHLSIGIR ELRLEYHLSIGIR
ELRLEYHLSIGIR
ELRLEYHLSIGIR ELRLEYHLSIGIR
AMPHIOXUS FLY
SLSEIWNTVDRGK KKLHHHGPVAVRK
-MLSEFESSIHIS STAELKSKKLTSR
Fig. 1 Phylogeny and functional divergence in BMP2/4. (A) Phylogenetic tree of BMP2/4
by Neighbor-joining method. Bootstrapping values more than 50% were presented. (B)
Sites predicted to be critical for functional divergence between BMP2 and BMP4.
value P,(S i|X) = 0.51 (as these sites are moved from the alignment, 0 drops to 0). They can
be grouped into two categories: Category I: high conservation observed in BMP2 sites, while
considerable variation in BMP4 sites; Category II, in contrast, almost invariant residues in
BMP4, but variable in BMP2 (Fig IB). Both categories show the typical pattern of functional
divergence: altered evolutionary rates and resulting functional divergence after gene
duplication, in which one duplicate copy retains conserved functions and accepts most of
selection pressure, while the other copy is free to accumulate amino acid changes for
functional novelty.
j.2.2. TGF-jg
TGF-P, whose major functional role is in cell differentiation and tissue repair, is
composed of three endogeneous growth factors: (31, (32, and (33. Despite the sequence and
structural similarities [12, 13], these three isoforms have been shown apparent differences in
activity such as ligand binding affinity [14],
52
(A)
t? qq«- F
•ÇË
i
9 p HUMAN
MOUSE
RAT
PIG
CHICKEN
99T '
97fiM
-T-'l—c,
(B)
99r HUMAN
P M ONKEY
L DOG
,_39 r BOVINE
I ' SHEEP
PIG
HORSE
GUINEAPIG
MOUSE
RAT
CHICKEN
FORG
. BASS
— CARP
TGFJ3-1
9, HUMAN
J MONKEY
1 PIG
MOUSE
RAT
ef- HAMSTER
— CHICKEN
— FROG
— CARP
TGFp-2
TGFP-3
TGFb 1 vs. 2
0 . 8
£ 0.6
0.4
CO f-. <J> CM CM CM CM
TGFb 1 vs. 3
0.8
So.6 | 0.4
0.2 Iééé̂ K 3 s a 8 $ R
T— CXJ <N CM CM CM
1
0.8
EO.6 -
S0.4-
0.2
TGFb 2 vs. 3
alignment position
Fig. 2 Phylogeny and functional divergence in TGF-J3 ligand family. (A) Neighbor-joining tree of ligands TGF-[is. Bootstrapping values more than 50% were presented. (B) site-specific profiles for pairwise comparison of three TGF-J3 isoforms. P(Si|X) is the posterior probability for a site being functional-divergence related.
Phylogenetic analysis shows that three TGF-jB isoforms are monophyletic (Fig. 2A).
Two rounds of gene duplication may have occurred to give rise to three paralogous clusters
after the divergence of tetrapods and fishes: the first round generates (31 and the common
ancestor of (32 and (33, then the second round produces (32 and (33. This provides evidence
that TGF-(3 ligand may belong to a large group of cytokines that emerged after large-scale or
genome-wide duplications [11].
The functional divergence analysis reveals that the coefficients of functional
divergence (0) for pair-wise comparison of TGF-(3s are significant, suggesting detectable
functional alteration occurred in the lineage leading to them (Table 1). Furthermore,
posterior statistics implies that strong variation in contribution of each site to the overall
53
functional divergence. Figure 2B presents respective site-specific profiles which have the
common pattern. The baseline of posterior probability for a site being functional divergence
related is about 0.2-0.3, implying that most of amino acid residues do not have significant
effects on long-term functional divergence; while several peaks show that only a small
portion of sites are mainly responsible for functional diversification in two paralogs. This
preliminary targeting strategy along with other available computational approaches provide a
start point for future experimental detection of 'hot-spots' in tissue-specific isoforms [15].
3.3. Functional Divergence in Receptors
Two types of receptors, Type II Receptor and Type I Receptor, have been identified
to bind specific TGF-(3 ligands. They belong to a large serine/threonine kinase superfamily
[1].
3.3.1. Type II receptor: structural divergence —> functional divergence ?
Type II receptor family is divided into five subfamilies: Activin Receptor 11 (ActR-
IIA), ActR-IIB, TGFB-I1R, BMPR-II, and AMHR. The existence of invertebrate homologs
(Drosophila, C.elegans, and sponge) suggests the ancient origin of this family: ActR-IIA and
IIB form a more closely related cluster, while others might have diverged earlier (Fig. 3).
It has been recognized that structural difference may result in the functional
significance [16]. Our recent study in apoptotic caspases has shown the structural basis of
functional divergence [7], Analogously, we found that the functional divergence between two
Type II receptor subfamilies, ActR-IIA and AMHR, may directly associate with their
structural alteration. ActR-IIA binds specifically to Activin, while AMHR only recognizes
MIS. The 3- D structure of ActR-IIA has been revealed and its geometrical difference with
AMHR has been discussed [17]. We obtained the coefficient of functional divergence
between them (0 = 0.709 ± 0.232). Moreover, 33 sites are predicted critical for their
functional divergence and then mapped onto the 3-D structure of ActR-IIA. Figure 4 shows
the alignment of extracellular ligand binding domain. Notice that 13 predicted residues lie in
important secondary structures (a helices, (3 sheets or strands). Similar to the pattern in
BMP2/4 subfamilies, these sites are only conserved in one cluster, either ActR-IIA or
AMHR, but variant in the other. Although the intrinsic mechanism is unclear, the sequence
54
69, HUMAN MOUSE RAT KMNE SHEB=
CHCKBM FROG
HUMAN RAT MOUSE BCMNE
CHICKEN FFIOG QOUTBH
2ŒRARSH FLY
FFOG
• CELBGANS
ActR-IIA
ActRUB
Put QAH
TGFpR-n
99
94 73
M F
—L
HUMAN RABBTT MOUSE RAT
] ]
65 SPONGE
SPCNGE
BMPR.n
AMHR wt Salk
Q15
Fig. 3 Phylogenetic tree of Type II receptors.
alteration may result in structural changes and further functional changes. Interestingly, the
sites which believed to play important role in ligand recognition: site 106, site 108 and site
109 (as shown in Fig. 4) show a noticeable pattern. In ActR-IIA those positions hold highly
conserved Asp. Asp and Lys, while in AmhR are Arg, Lys and Tyr. The amino acid
configurations are invariable in both clusters but whose biochemical properties are very
different, which suggested that these sites might determine the ligand specificity of different
receptors.
3.3.2. Type / recgpfor
Based on the identity of physiological ligands, Type I receptor is divided into 7 major
subfamilies: TfSR-I, ActR-I(3, ALK7, BMPR-IB, BMPR-IA, ALK1, and ActR-I. NJ tree
shows that they fall into separate clades with high bootstrap support (>95%) (Fig. 5). Based
on kinase domain similarity and signaling activities, they can be clustered into three groups.
55
pi Ql pz
1 1 11 1 1 777 88 9 0 0 00 2 3 789 45 6 0 6 89 3 2
HUMAN AILGRSETQECLFFNANWERDRT NQTGVEP-CYGDKDKRRHCFATWKNISGSIEIVKQGCW-LDD RAT AILGRSETQECLFFNANWERDRT NQTGVEP-CYGDKDKRRHCFATWKNISGSIEIVKQGCW-LDD MOUSE AILGRSETQECLFFNANWERDRT NQTGVEP - C Y GDKDKRRHCF ATWKNISG SIEIVKQGCW- LDD BOVINE AILGRSETQECIFYNANWERDRT :--NRTGVES-CYGDKDKRRHCFATWKNISGSIEIVKQGCW-LDD SHEEP AI LGRSETQECIYYNANWEKDKT IAVGIEP-CYGDKDKRRHCFATWKNISSSIDIVKQGCW-LDD CHICKEN AILGRSETQECI YYNANWEKDKT 1AVGIEP -CYGDKDKRRHCF ATWKNI SGS IE I VKQGCW- LDD FROG SILGRSETKECI YYNANWEKDKT NSNGTEI -CYGDNDKRKHCFATWKNI SGSIEIVKQGCW -LDD HUMAN EAPPNRR—TCVFFEAPGVRGSTKTLGELLDTGTELPRAIRCLYSRCCFGIWNLTQDRAQVEMQGCRDSDE MOUSE QVSPNRR--TCVFFEAPGSRGSTKTLGEMVDAGPGPPKGIRCLYSHCCFGIWNLTHGRAGVEMQGCRDSDE RAT QVSPNRR--TCVFFEAPGVRGSTKTLGEWDAGPGPPKGIRCLYSHCCFGIWNLTHGRAQVEMQGCLDSDE RABBIT QAPPNRR—TCVFFEAPGVRGSTKTLGELLDAGPGPPRVIRCLYSRCCFGIWNLTRDQAQVEMQGCRDSDE
P5 P7
1 6 4
HUMAN INC YD - RTDCVEKKD SPEVYFCCCEGN RAT INCYD-RTDCIEKKD SPEVYFCCCEGN MOUSE INCYD-RTDCIEKKD SPEVYFCCCEGN-BOVINE INCYD-RTDCIEKKD SPEVYFCCCEGN-SHEEP INCYD-RTDCIEKKD SPEVYFCCCEGN-CHICKEN INCYD-RNDCIEKKD SPEVFFCCCEGN-FROG INCYN-KSKCTEKKD SPDVFFCCCEGN-HUMAN PGCESLHCDPSPRAHPSPGSTLFTCSCGTD-MOUSE PGCESLHCDPVPRAHPNPSSTLFTCSCGTD-RAT PGCESLHCDPVPRAHPSPSSTLFTCSCGTD-RABBIT PGCESLSCDPSPRARASSGSTLFTCSCGAD-
1 11 7 78 7 90
-MCNEKFSYFPEMEVTQPTSNPVT-PKPP -MCNEKFSYFPEMEVTQPTSNPVT-PKPP -MCNEKFSYFPEMEVTQPTSNPVT-PKPP -MCNERFSYFPEMEVTQPTSNPVT-PKPP -MCNERFSYFPEMEVTQPTSNPVT-PKPP -MCNERFSYFPEMEVTQPTSNPVT-PKPP -YCNEKFYHSPEMEVTQPTSNPVT-TKPP -FCNANYSHLPPPGSPGTPGSQGPQAAPG -FCNANYSHLPPSGNQGAPGPQEPQATPG - F CNANY SHL PPSGNRGAPG PQEPQATPG -FCNANYSHLPPLGGPGTPGPQGPQAAPG
Fig.4 Sequence alignment of extracellular ligand-binding domain of ActR II-A and AmhR.
Secondary structure elements refer to the seqence of human ActR II-A. Predicted critical
residues for functional divergence are listed (site number corresponds to position in sequence
alignment of full length genes).
One group includes T(3R-I, ActR-I(3 and ALK7; another includes BMPR-IB and BMPR-IA;
and the third includes ALK1, and ActR-I.
The functional divergence analysis of pair-wise comparisons of Type I receptor
showed the statistically non-zero coefficient of functional divergence (0) (data not shown).
This analogous result to ligands and Type II receptor suggested that altered functional
constraint after gene duplication is a general pattern at different levels of signaling pathway.
56
ActR-ip
ALK7 AM
BMPR-IB
FUŒJ
99FSHEEP
QUAIL CHCKEN
ZEBRAFISH HUMAN
52= QUAIL CHCKEN
XLU16654.PE1 ZEBRAFISH
FLY Tkv
ALKl
ActR-1
BMPR-IA
ELECUWS YQD2 DAF1CAŒL
Fig.5 NJ tree of Type I TGF-P receptor.
4. Conclusions
In this paper, we have explored the functional divergence in TGF-P signaling. Our
findings are: (1) The altered functional constraint is a common pattern in both ligand and
receptor evolution; (2) Gene duplication might be an evolutionary force for functional
divergence, the early vertebrate stage might be an important time for the development of
tissue specificity and signaling pathways (3) There may be correlation between structural and
functional divergence.
Acknowledgement
This study was supported by NIH grant ROI GM62118 (to X.Gu).
References
1. J. Massague, TGF-beta signal transduction, Annu Rev Biochem. 67 (1998) 753-791.
57
2. S.J. Newfeld, R.G. Wisotzkey, S. Kumar, Molecular evolution of a developmental
pathway: phylogenetic analyses of transforming growth factor-beta family ligands, receptors
and Smad signal transducers. Genetics 152 (1999) 783—795.
3. W.-H. Li, Evolution of duplicate genes and pseudogenes, in: M. Nei and R.K. Keohn
(Ed.) Evolution of Genes and Proteins. Sinauer Associates, Sunderland, MA, 1983, pp. 14—
37.
4. K. Wolfe, Yesterday's polyploids and the mystery of diploidization, Nat Rev Genet. 2
(2001)333-341.
5. X. Gu, Statistical methods for testing functional divergence after gene duplication. Mol
Biol Evol. 16 (1999) 1664-1674.
6. X. Gu, Maximum likelihood approach for gene family evolution under functional
divergence. Mol Biol Evol. 18 (2001) 453—464.
7. Y. Wang, X. Gu, Functional divergence in the caspase gene family and altered functional
donstraints: statistical analysis and prediction,
Genetics 158 (2001) 1311-1320.
8. S. Kumar, S.B. Hedges, A molecular timescale for vertebrate evolution, Nature 392
(1998) 917-920.
9. R.M. Harland, The transforming growth factor beta family and induction of the vertebrate
mesoderm: bone morphogenetic proteins are ventral inducers, Proc Natl Acad Sci USA. 91
(1994) 10255-10259.
10. J.M. Wozney, et al. Novel regulators of bone formation: Molecular clones and activities,
Science 242 (1988) 1528 -1534.
11. N. Iwabc, K. Kuma, T. Miyata, Evolution of gene families and relationship with
organismal evolution: rapid divergence of tissue-specific genes in the early evolution of
chordates, Mol Biol Evol. 13 (1996) 483—493.
12. A.P. Hinck, et al. Transforming growth factor beta 1: three-dimensional structure in
solution and comparison with the X-ray structure of transforming growth factor beta 2,
Biochemistry 35 (1996) 8517—8534.
13. P R. Mittl, et al. The crystal structure of TGF-beta 3 and comparison to TGF-beta 2:
implications for receptor binding, Protein Sci. 5 (1996) 1261 — 1271.
58
14. J.C. Jennings, Comparison of the biological actions of TGF beta-1 and TGF beta-2:
differential activity in endothelial cells, J Cell Physiol. 137 (1988) 167—172.
15. E.A. Gaucher, M.M. Miyamoto, S.A. Benner, Function-structure analysis of proteins
using covarion-based evolutionary approaches: Elongation factors. Proc Natl Acad Sci. USA.
98 (2001) 548-552.
16. G.B. Golding, A.M. Dean, The structural basis of molecular adaptation. Mol Biol Evol.
15 (1998) 355 -369.
17. J. Greenwald, et al. Three-finger toxin fold for the extracellular ligand-binding domain
of the type II activin receptor serine kinase, Nat Struct Biol. 6 (1999) 18—22.
59
CHAPTER V: EVOLUTIONARY PATTERN OF PROTEIN KINASE EXPANSION
IN HUMAN GENOME
A paper to be submitted1
Jianying Gu XunGu^
Abstract
Kinases catalyze the reversible protein phosphorylation and play a central role in
fundamental biological functions and in advanced cell-cell co-regulation. A comprehensive
set of kinases (kinome), characterized by a conserved catalytic kinase domain, has recently
been identified in human genome through exhaustive bioinformatic search. We conducted
phylogenetic analysis on major kinase gene families. The age distribution of kinase gene
families showed that ancient continuous domain shufflings (or duplications) during the time
period from early stage of eukaryotes to metazoan evolution have built up the major kinase-
related animal specific signal transduction network. The following large scale gene
duplications in the early stage of vertebrate have contributed to the tissue-specific signal
transduction. Moreover, several kinase pseudogenes are likely to be generated through either
segmental duplication or retrotransposition very recently.
Introduction
Reversible protein phosphorylation plays a central role in regulating many functions
of eukaryotes such as cell cycle control, gene transcription and protein translation, apoptosis,
cell differentiation, cell-cell communication, and to mediate complex interactions with the
external environment. Moreover, aberrant protein phosphorylation is commonly related with
cancer and other human diseases. A comprehensive knowledge of the key actors that
regulate these functions can provide basis for better understanding of underlying mechanism
for cell life and novel drug discovery strategies.
1 manuscript to be submitted. 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Iowa State University. 3 Author for correspondence
60
The completion of human genome sequence provides us an opportunity to decipher
the molecular nature of signal transduction machinery in human. A total of 478 protein
kinases have been identified in human genome, constituting about 1.7% of all human genes
(Manning et al.. 2002). All of known bona fide protein kinases share a conserved catalytic
kinase domain of ~ 270 amino acids formed a superfamily - eukaryotic protein kinase (ePK).
Two distinct major families, protein tyrosine kinases (PTKs) and protein serine/threonine
kinases (PSKs), are defined based on comparisons of kinase domains. Another set of 40
atypical protein kinases (aPK) have also been identified in human genome, with
demonstrated kinase activity despite the lack of sequence similarity to ePK domain.
It has been widely accepted that the duplications of individual gene, chromosome
segment, or whole genome provide the primary material for origin of functional novelties
(Ohno, 1970). Recently, we have demonstrated that the age distribution of human gene
families can be described as an ancient component, a peak in the early stage of vertebrates
(wave II), and a recent peak during mammalian radiation (wave J) (Gu et al., 2002). We
have further speculated that the major signal transduction pathways emerged through those
duplication events. It is indeed intriguing to study the functional divergence and
evolutionary patterns of the entire set of kinase (kinome, defined by Manning et al. 2002).
Here we report the age distribution of human kinases and pseudogenes and show a complex
history of domain shufflings, gene duplications, and gene losses in kinome expansion.
Data and Methods
Dofo coZ/gcfion
In order to identify the homologues of 518 human protein kinases in vertebrates such
as other mammals, birds, frogs, and fishes, we searched HOVERGEN protein database
(Homologous Vertebrate Genes Database, http://pbil .uni v-1 von 1 .fr/databases /hoverpen html.
release 42 updated on 26 April, 2002) using BLASTP, followed by a pairwise BLAST to
removed the redundant sequences. Invertebrate homologous sequences were obtained from
http://www.kinase.com. and used as outgroup for each gene family. Domain organization of
each kinase gene family was revealed by searching against the Pfam domain database
(http://pfam.wustl.edu/).
61
The orthologous mouse DNA sequence for human kinase pseudogenes were obtained
by BLASTN searching against ENSEMBLE mouse genome (http://www.ensemble.org).
Three MARK pseudogenes (MARKpslO, 13, 20) were not analyzed due to short length (<
200 nucleotide acid). With a cutoff E value le-20, two pseudogene SAKps, SGK050ps has
NO mouse hits, which will be excluded from further analysis.
Multiple alignments and phylogenetic inference
Multiple sequence alignment of each gene family was obtained by Clustal X
(Thomson et al., 1997), followed by manual editing. Phylogenetic trees were inferred by the
Neighbor-joining method (Saitou and Nei, 1987) using MEGA2.0 (http://megasoftware.net).
Maximum Parsimony (as implemented in PAUP4.0), and Maximum Likelihood (as
implemented in PHYLIP) were used to examine whether the inferred phylogeny is sensitive
to any tree-making method. The bootstrap resampling with 1,000 pseudo-replicates was
carried out to assess support for each individual branch.
Zn/êrencg of dwpZzcafzo» even#
To identify the underlying evolutionary mechanism (i.e., domain or gene duplication)
of each duplication event, we have examined the domain structure of each human protein
kinase as follows. First, homologous kinase sequences in other vertebrate species (e.g., other
mammals, birds, frogs, and fish) were identified by blasting the 518 human kinase protein
sequences against the HOVERGEN protein database. A total of 221 HOVERGEN families
were retrieved and further analyzed. Second, since the HOVERGEN family is defined solely
based on the overall sequence similarity and may not reflect functionality, we adopted the
hierarchical kinome classification scheme proposed by Manning et al. (2002) which
incorporates sequence inference, structure comparison, and known biological functions.
Third, for multi-domain kinase genes, in addition to the kinase domain, additional domain
structure was revealed by using the HMM and profile-searching in the Pfam database
(http://pfam.wustl.edu/). Finally, combining the information from all the previous analysis,
we classified kinases with similar multi-domain structure into 256 gene families. Here, we
define the duplication event occurred within each gene family as gene duplication (Fig. I),
and that occurred among gene families as domain duplication (Fig. 2). The remaining single-
62
domain kinases (e.g., CDK family) that lack of hierarchical domain structure were analyzed
separately.
97
AKT1
JC
HUMAN
RAT
-MOUSE
L BOVIN
CHICKEN
FROG
r- HUMAN
MOUSE
RAT
CHICKEN
r MO
L—F
-ZEBRAFISH
HUMAN
MOUSE
— RAT
-FLY
HUMAN '
RAT
MOUSE
BOVIN
CHICKEN M--FROG
HUMAN
-MOUSE
RAT
CHICKEN
ZEBRAFISH .
HUMAN *
MOUSE
RAT
FLY
WORM-1
WORM-2
rt
0.05 I I I I I I I I I I I I II I I i I I i I I I
0.20 0.15 0.10 0.05 0.00
Fig. 1. Molecular Evolutionary analysis of the Rac heta serine/threonine protein kinase
(AKT) gene family. In vertebrate, there are three member genes: AKT 1, AKT2 and AKT3.
(a) The phylogenetic tree inferred by the neighbor-joining method with Poisson distance.
Bootstrapping values of more than 50% were presented. T, and T% are the time points of the
first and second gene duplications, respectively, (b) The linearized neighbor-joining tree
used in converting evolutionary distance to the (relative) molecular time scale. By the
nearest-neighbor clock, we estimated Tt = 478 Mya and T2 = 522 Mya.
63
100 IMERK
kinase FN Ill-like ig-like 81
-L
100 I UFO A X L
I TYR03 100
100 ROR1 kinase Kringle Ig-like
-L ROR
100 Cysteine rich
i ROR2 100
0.2
Fig. 2. Phylogenetic relationship and the domain structures of protein tyrosine kinases AXL
and ROR gene family.
A linearized neighbor-joining tree is efficient to convert the (average) Poisson
distance of protein sequences to the molecular time scale. Here primate-rodent (80 mya),
mammal-bird (310 mya), mammal-amphibian (350 mya), tetrapod-teleost (430 mya) and
vertebrate-Drosophila split (830 mya) were used as calibration in our study (Wang and Gu,
2001). Two steps were taken to adopt a nearest-neighbor clock (Gu et al., 2002). First,
under the tree, the phylogenetic interval of a duplicate event is determined by the nearest
speciation event(s). It can be two sides (before, after), e.g., (teleost-tetrapod, Drosophila-
64
vertebrate), or one side, e.g., (teleost-tetrapod, °°) or ( M , teleost-tetrapod) for before or after
the speciation, respectively, where M (the mammalian radiation) is for the lowest side. Note
that the phylogenetic interval is robust against the differential selective constrains in genes
and lineages. Second, the age of duplication event is estimated based on the nearest
speciation event(s). In practice, if the "before"-side is used, an average over two lineages is
taken when they have the same nearest speciation. If the phylogenetic interval is two-sided,
one may use two points for dating (Fig. 1).
To date the ancient duplication events occurred in early Metazoan evolution, a
hierarchical nearest neighbor clock is applied. The dates of the nearest duplication events
inferred by the previous nearest-neighbor clock or nearest speciation points (chosen based on
its applicability) were used to infer the ancient time point of duplication events.
Analysis of kinase pseudogenes
There are two major types of pseudogenes: processed pseudogene and duplicated
pseudogenes. Processed pseudogenes result from retrotransposition, that is, reverse-
transcription of mRNA transcript followed by integration into genomic DNA. Duplicated
pseudogenes arise from duplication of genomic DNA, after which a duplicate gene copy
became a pseudogene. Neighbor-joining tree was inferred from the multiple alignment of
DNA sequences of kinase pseudogene, its parent human kinase gene, homologous mouse
gene, and homo logs in other vertebrate species (Fig. 3). The branch length in NJ tree
represents the number of nucleotide substitution per site (under Kimura 2 parameter model)
along that lineage. Due to relaxation of selection constraint, fast evolution occurred along
the pseudogene lineage. Thus we estimate the time of the events to generate pseudogene
assuming a local molecular clock between the functional human/mouse kinase genes.
Results
Protein kinases: An evolutionary overview
We have reconstructed phylogenetic trees for 518 human protein kinases and 106
human pseudogenes, which include their homologous genes in vertebrate (mammals, birds,
frogs, and fishes), invertebrate (largely fruitfly and worm), and other eukaryotes. Our
phylogenetic analysis is generally consistent with the human kinome classification (Manning
65
T = 80 Mya
0.0613
\ 100
8 2
0 0 7
0 . 0 3 9 0
0.0280
• P A K 2
0 . 0 3 7 2
0 . 0 0 9 1
- R A B B I T
100 0.0310
• M O U S E 0 . 0 2 2 5
R A T 0 . 0 2 5 7
0.1354
- P A K 2 p s
• F R O G
0.02
Fig. 3. Molecular phylogeny of the P21-activated kinase (PAKA) gene and pseudogene.
The branch length represents the number of nucleotide substitution per site under a Kimura
two-parameter model. The human-mouse split T = 80 Mya is used to estimate T, to be 18
Mya, the time point to generate PAK pseudogene.
et al., 2002). We have confirmed that most protein kinase families had existed before the
origin of vertebrate and most subfamilies have emerged in the early stage of vertebrates
(Manning et al., 2002b; Miyata and Suga, 2001). Protein tyrosine kinase (PTKs), one major
animal-specific subclass that includes 29 families, is likely monophyletic, derived from a
precursor of protein serine-threonine kinase (PSK) during the course of animal evolution,
while the major types of PSKs had existed in the early stage of eukaryotes (Hanks et al.
1988; Hanks and Hunter, 1995; Hunter et al. 1995).
Comparison of the multiple domain structure of protein kinases showed that the
kinase domain is the only conserved region among most protein kinase families, indicating a
reminiscent important role of domain shuffling/duplication on the emergence of major signal
transduction pathways featured by different protein kinases. In contrast, the multiple domain
structure of most kinase families remains conserved among member genes, suggesting the
impact of gene or genome duplications on protein kinase expanding toward
development/tissue specificity. Moreover, only a very few (~2) human protein kinases are
66
recently duplicated (after human/mouse split), in spite of recently finding that a significant
portion of human genome has been duplicated within 40 million years (Lander et al., 2001).
To further characterize the relationship between underlying evolutionary
mechanisms, evolutionary time periods, and the evolution of signal-transduction pathway, we
have developed a robust procedure to infer the events of domain shuffling, gene duplication,
or pseudogene duplication (see Methods). The 524 duplication events identified all occurred
in the ancestral lineage of humans. Among them, 95 are pseudogene duplication/
retrotransposition events, 222 are gene duplications, and 187 are domain duplication events.
We are not able to classify the rest 20 duplication events that occurred among single-domain
kinases as domain or gene duplications.
Remarkably, the age distribution of protein kinases and pseudogenes has recaptured
the unique evolutionary features of human gene families as first described by Gu et al.
(2002). Clearly, the age distribution of human kinome shows a pattern characterized by three
modules (I, II, and 111) (Fig. 4). Module I coincides with the processed pseudogene
explosion (Zhang et al., 2002) or abundant recent tandem or segmental duplications (Eilcher,
2001). Module II indicates kinase gene family expansion in the early stage of vertebrate
evolution by genome (large-scale) duplication (Gu et al., 2002; McLysaght et al., 2002). The
ancient module (III) includes duplication events that took place during and before metazoan
evolution.
Ancient components: domain shuffling/duplication
Domain shuffling (or duplication), which can occur by transposition of gene
fragments or nonhomologous recombination, is thought to have been a major mechanism in
the evolution of new multiple-domain proteins (Doolittle, 1995). Like many other molecules
involved in signal-transduction, most protein kinases are composed of multiple domains with
discrete structural and specific functions. Separate phylogenetic analysis of protein kinase
domain and their associated non-catalytic domains (SH2, SH3, PHD, etc.) has confirmed that
the kinase superfamily was indeed generated by shuffling of domains from a few more
ancient kinases (Krupa and Srinivasan, 2002).
It is known that time estimation of ancient evolutionary events (either duplication or
speciation) may have broad sampling variance or bias (Gu, 1998). Nevertheless, the age
67
B. a
S ' 3
III
<1.0 1.0-1.5 1.5-2.0 2.0-2.6 2.5-3.0 3.0-3.5 3.5-4.0 4.0-4.6 4.5-5.0 >5.0
molecular time scale (Byr ago)
«p ^
molecular time scale (Myr ago)
Fig. 4. Age distribution of human kinases. The molecular time scale is measured as Myr ago
(Mya). Each bin for the histogram is 50 Myr, except for the most ancient one (>1,500 Myr
ago). 1 Byr, 1 billion years ( 1,000 Myr).
68
distribution of domain shuffling events in protein kinases shows that the majority of domain
duplication events happened during very ancient time period. As shown in Fig. 4, no domain
shuffling event is found after 550 mya, and the majority had occurred during the time period
from early stage of eukaryotes to metazoan evolution. Conserved motifs shared with
eukaryotes protein kinases were identified in bacteria, as well as in archaea, which inferred
the existence of an ancestral protein kinase prior to the divergence of eukaryotes, bacteria
and archaea (Leonard et al, 2002). However, the biological reason why domain shuffling
virtually ceased after the origin of vertebrates remains unclear.
On the basis of phylogenetic analysis of eight animal-specific gene families including
sponge protein tyrosine kinases, Suga et al. (2001) suggested that most domain shuffling
events were going back to dates before the parazoan-sumetazoan split (~ 940 mya), the
earliest divergence among extant animal phyla. Our comparative analysis, however, does not
support this notion. Rather, we provide evidence (Fig. 4) that during the time period from
early stage of eukaryotes to metazoan evolution, roughly corresponding to 2,000 mya (2.0
bya, billion yeas ago) to 700 mya, domain shuffling in protein kinases had occurred in a
fashion of continuous flux, with an approximate constant rate of 0.7x10 * event per year per
genome. Before this time period (very early eukaryote stage, or prior to eukaryote-archaea
split, > 2.0 bya), the rate of domain shuffling was down to -50%, while after this time period
(vertebrates), the rate is virtually zero.
Large scale duplication of human protein kinases
It has been proposed long time ago that genome duplications play important role in
building up the complexity of human genome (Ohno, 1970). Our recent analysis of age
distribution of human gene families has confirmed a rapid explosion of genes in the early
stage of vertebrate (Gu et al., 2002), resulting in tissue-specific (developmental stage-
specific) isoforms. Molecular phylogeny-based analysis of gene families involved in cell-
cell communication and developmental control (e.g., protein tyrosine kinases, phosphatases,
etc.) has shown that extensive gene duplications occurred in a period around or immediately
before the divergence of tetrapods and fishes (Miyata and Suga, 2001). The recent
completion of human kinome (Manning et al., 2002) provides us an opportunity to study the
69
evolution pattern of the complete set of human kinases, a better understanding for the whole
set of human genes.
Based on sequence similarity (kinase domain similarity), domain organization and
known biological functions, we identified 81 gene families with two human kinase member
genes, 34 gene families with three human kinase member genes, and 14 gene families with
four human kinase member genes. Our classification generated slightly different kinase gene
families compared to the kinome classification scheme. No evidence supporting the 1:4 rule
(i.e., there is one gene in Drosophila but four homologous genes in human) exists in the
human kinome evolution. Moreover, we also examined the phylogeny of gene families with
four member genes. Two types of topologies can be observed ((A, B) (C, D)) or (((A, B), C),
D). There is no evidence that the topology of ((A, B), (C, D)) as expected by 2R genome
duplication hypothesis occurred more frequently than the other topology.
Among 222 estimated kinase gene duplication events occurred within the gene
families. 135 (~ 61%) occurred in the time period before the emergence of teleosts but after
the vertebrate-amphioxus split (430-750 Mya) (module II in Figure 4). This rapid increase in
the number of paralogous human kinases observed in the early stage of vertebrate evolution
is as expected from the genome duplication hypothesis. Human tissue specific isoforms of
protein kinases are involved extensively in signal transduction in eukaryotic cells, as well as
in controlling many other cellular processes. Thus genome duplication may have major
contribution to the tissue- (or developmental stage-) specific kinase isoforms after the split
between vertebrate and anthropods.
Evolution of human kinase pseudogenes
For 106 human kinase pseudogenes identified by Manning et al. (2002), 7
pseudogenes are excluded from further analysis due to very short sequences. Phylogenetic
analysis shows that most human kinase pseudogenes (75/99) are generated after the
mammalian radiation. Because of the apparent fast evolution caused by the loss of functional
constraint in the pseudogene lineage, we use some special treatments to date the duplication
events which gave rise to the current pair of functional (parent) kinase gene and pseudogene
(see Methods).
70
Generally, pseudogenes are believed to be generated through two different
mechanisms: via retrotransposition resulting in an intronless processed pseudogene or via
duplication of genomic DNA resulting in duplicated pseudogene. In 106 human kinase
pseudogenes, 75 lack introns such that they were likely to be generated by retrotransposition
of a processed transcript. It has been shown that a little more than half or even a much higher
eighty percentage of pseudogenes are processed (Harrison et al., 2002; Migghell et al., 2000),
so it is not surprising that the majority of human kinase pseudogenes are processed. It has
been proposed that processed pseudogenes, as well as both the Alu repeats (a major class of
SINEs, short interspersed elements) and LINE repeats (long interspersed elements), arose by
the protein machinery encoded by LINE! elements (Jurka, 1997; Esnault et al., 2000). The
age distribution of human kinase intronless pseudogenes showed that more than half of
intronless pseudogenes were generated very recently (< 30 mya) in a fairly constant rate (Fig.
5), while a considerable proportion of MARK intronless pseudogenes were generated after
human-mouse split.
20
15
10
" 5
I
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 > 80
molecular time scale (Myr ago)
Fig. 5. Age distribution of human processed pseudogenes.
71
Twenty-nine pseudogenes were identified that had no introns. They are more likely
to be generated through gene/genome duplication events, and most of them occurred on the
anthropoid lineage. Recent analysis of human genome sequence has revealed the fact that
about 5% of the human genome consists of interspersed segmental duplications that have
arisen over the past 35 million years (Eichler 2001, etc.). The age distribution of kinase
pseudogenes obviously resembles the pattern of recent segmental/tandem duplication (data
not shown). At this rate, we estimated that 478 x 5% ~ 24 kinase pseudogenes were
generated through segmental/small-scale duplications. According to these observations, we
propose a scenario for recent human protein kinase evolution: within 35 million years,
segmental duplications (small-scale mode, Gu et al. 2002) provided a continuous flux for
generating duplicate kinase genes, which evolved into protein kinase pseudogenes. For
example, two p70 ribosomal protein S6 kinase (p70S6K) pseudogenes appears to be part of a
large duplicon containing multiple duplicated genes, suggesting an intrachromosomal
duplication of the p70S6K locus (Manning et al, 2002). Apparently, these duplicates finally
become nonfunctional (pseudogene) because there seems no need for new protein kinases. In
this regards, one may conclude that the protein kinase-related signal transduction pathway
evolves slowly after the mammalian radiation, while regulatory motif changes may become
the major force for functional/expressional divergence. This hypothesis should be tested by
further multi-genome analysis of protein kinases.
In 106 human kinase pseudogenes, 27 are found to have matching ESTs. The age
distribution of pseudogene duplications generated these "expressed" pseudogene appears
random and uniform, indicating no obvious correlation between the age of pseudogene
duplication and gene transcriptional regulation change. There are many possible
explanations: (/) the recent produced pseudogene is still under the dynamic processing, and
has not lost all its transcriptional capability; (ii) the mutations accumulated that resulted in
functional loss were likely to happened in the coding region, but not in transcription
regulatory region; (iii) misidentified ESTs for kinase pseudogene.
Fates of duplicate genes and the gene-loss hypothesis
Gene duplication has generally been viewed as a necessary source for the origin of
novel functionality. In addition to ignite gene function innovation, gene duplication may also
72
provide a redundant copy such that it will lose the original function. Because the majority of
mutations are deleterious and gene duplications are generally assumed to be functionally
redundant at the time of origin, the usual fate of a duplicate-gene pair is the
nonfunctionalization of one copy (Lynch and Conery, 2000). Recently, a less-is-morc
hypothesis has been proposed to emphasize the importance of loss-of-function mutations on
the recently evolved novel lineage such as the human (Olson and Varti, 2003). In several
cases, genetic loss in human lineage has been shown to cause some differences between
human and chimpanzee.
As reported in Manning et al. (2002), three pseudogenes have no obvious parent
kinase in human but have functional orthologs in rodents. Our re-examine showed that
parent kinase of polo-like kinase SGF384ps has been identified (EMBL Accession:
AK054808) in human. But for KSGC (kinase-like domain containing soluble guanylyl
cyclase) and CGD, no functional human orthologous kinase has been identified (Figure 6).
Phylogenetic analysis shows that they both belong to the retinal guanylyl cyclase family
(GTP pyrophosphate-lyase). These two pseudogenes both contain introns and conserved
canonical splice sites. Moreover, matching ESTs have been found. We inferred these two
genes lost their functions recently such that not too many mutations have been accumulated.
In rat, CGD is specifically expressed in a small, randomly dispersed subpopulation of rat
olfactory sensory neurons and plays a role in interacting with a restricted group of odors
since its expression pattern resembles the pattern of expression of the diverse seven trans-
membrane-domain odorant receptors (Fulle et al., 1995; Julifs et al., 1997). KSGC found in
rat kidney cells appear to contain the kinase-like, dimerization and catalytic domains but
containing no ligand-binding or transmembrane domains, indicating that KSGC is a
cytoplasm!cally localized GC (Kojima et al., 1995). The specified function and unique
domain structure of CGD and KSGC suggested that the decay of previous functional genes in
the human lineage after its split from rodents. This hypothesis could be further tested when
genome sequences from other species such as chimpanzee are available.
73
80
100 ,
- H U M A N
D O G
B O V I N E
| R A T
1 0 0 C O U S E
C H I C K E N
G U C Y 2 E
- C G D p s
100 10 0
100
M O U S E
- R A T
- H U M A N
Q U C Y 2 D ( C G D )
c • B O V I N E
R A T
G U C Y 2 F
7 3 9 7
• M E D A K A F I S H O L G C - 5
M E D A K A F I S H O L G C - 3
- M E D A K A F I S H O L G C - 5
5 7
9 5
100 7 9
— H U M A N
G U I N E A P I G
• B O V I N E
- P I G
— R A T
• F R O G
Q U C Y 2 C
9 9
100
• M E D A K A F I S H O L G C - 6
• E E L
60 9 9
H U M A N
| RAT I M O U S E
F R O G
B U L L F R O G
A N P A
9 9
10 0 7 1
100 100
- H U M A N
- B O V I N E
— R A T
- E E L
D O G F I S H
A N P W
• F R O G
• M E D A K A F I S H O L G C - 2
• E E L
• M E D A K A F I S H O L G C - 7
K S G C p s
100
M O U S E
- R A T
K S G C
0.1
Fig. 6. Molecular phylogenetic analysis of retinal guanylyl cyclase. Two human kinase
pseudogenes CGDps and KSGCps are indicated.
74
Discussion
The recently catalogued complete set of human protein kinases (kinome) allows us to
test an interesting hypothesis on the evolution of protein kinases at the genome level: (i) The
major kinase-related animal specific signal-transduction pathways have been generated
through domain shuffling during the time period from the early stage of eukaryotes to the
course of metazoan evolution; (ii) vertebrate tissue-specificity of signal-transduction is
f a c i l i t a t e d b y l a r g e - s c a l e d u p l i c a t i o n e v e n t ( s ) i n t h e e a r l y s t a g e o f v e r t e b r a t e s , a n d ( i i i )
continuous small-scale duplication events and retrotranspositions occurred recently resulted
in the nonfunctional kinase pseudogenes.
Unlike the continuous mode of retrotransposition of kinase processed pseudogenes,
the distribution of ribosomal protein (RP) pseudogenes shows a peak at an evolutionary age
corresponding to 8% - 10% sequence divergence, whereas Alus peaks at 7% and LINE1 peak
at both 4% and 21% (Zhang et al., 2002). Considering the fast evolution rate of pseudogene
(3.9 x 10 * substitution per year) and the relative slow evolution rate of functional gene (1.5 x
10"9 substitution per year) (Li, 1997), one percentage of divergence between pseudogene and
functional gene was converted into approximately 2.1 Myr. Thus the peak of generation of
processed RP pseudogenes appeared at 15-20 Mya and the rate of new processed RP
pseudogenes generated has slow down after that. The overall activity of transposons has
declined markedly over the past 35-50 Myr, with the possible exception of LINE1 (Nature
human genome, 2001). Interestingly, both LINEs and processed pseudogenes are more
prevalent in relatively GC-poor regions, while Alus are predominantly found in GC-rich
regions (Pavlicek et al., 2001 ; Zhang et al., 2002). Noticeable, processed pseudogenes arc
not expressed, however in very rare case, transcripts of some pseudogene have been reported
though the functional relevance of these pseudogene transcripts remains unclear. We
observed seventeen intronless pseudogenes having matching EST, one possible explanation
of this unusual high rate of transcriptional processed pseudogenes is the high sequence
similarity between processed pseudogene and ESTs (both contain no intron).
Acknowledgement: We are grateful to Yufeng Wang for constructive comments. This study
was supported by NIH grant ROI GM62118 (to X. G).
75
References
Bailey, J. A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M. D.,
Myers, E.W., Li, P.W. & Eichier, E. E. (2002) Recent segmental duplications in the human
genome. Science 297: 1003-1007.
Doolittle, R.F. (1995) The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287-
314.
Eichier, E. E. (2001 ) Recent duplication, domain accretion and the dynamic mutation of the
human genome. Trends in Genetics 17: 661-669.
Esnault, CD., Maestre, J. & Heidmann, T. (2000) Human LINE retrotransposons generate
processed pseudogenes. Nat. Genet. 24: 363-367.
Fulle, Vassar, R., Foster, D. C., Yang, R., Axel, R. & Garbers, D. L. (1995) A receptor
guanylyl cyclase expressed specifically in olfactory sensory neurons. Proc. Natl. Acad. Sci.
USA 92: 3571-3575.
Gu, X., Wang, Y. & Gu, J. (2002) Age distribution of human gene families shows significant
roles of both large- and small-scale duplications in vertebrate evolution. Nature Genetics 31:
205-209.
Hanks, S. K., Quinn, A. M. & Hunter, T. (1988) The protein kinase family: conserved
features and deduced phytogeny of the catalytic domains. Science 241: 42-52.
Hanks, S. K. & Hunter, T. (1995) The eukaryotic protein kinase superfamily: kinase
(catalytic) domain structure and classification. FASEB J. 9: 576-596.
Harrison, P. M., Hegyi, H, Balasubramanian, S., Luscombe, N. M., Bertone, P., Echols, N„
Johnson, T. & Gerstein, M. (2002) Molecular fossils in the human genome: identification and
analysis of the pseudogenes in chromosomes 21 and 22. Genome Research 12:272-280.
Hunter, T. (1995) Protein kinases and phosphatases: the yin and yang of protein
phosphorylation and signaling. Cell 80: 225-236.
Juiifs, D. M., Fulle, H.-J., Zhao, A., Houslay, M. D., Garbers, D. L. & Beavo, J. A. (1997) A
subset of olfactory neurons that selectively express cGMP-stimulated phosphodiesterase
(PDE2) and guanylyl cyclase-D define a unique olfactory signal transduction pathway. Proc.
Natl. Acad. Sci. USA 94: 3388-3395.
Jurka, J. (1997) Sequence patterns indicate an enzymatic involvement in integration of
mammalian retroposons. Proc. Natl. Acad. Sci. 94: 1872-1877.
76
Kojima, M., Hisaki, K., Matsuo, H. & Kangawa, K. (1995) A new type soluble guanylyl
cyclase, which contains a kinase-like domain: its structure and expression. Biochem.
Biophys. Res. Commun. 217: 993-1000.
Krupa, A. & Srinivasan, N. (2002) The repertoire of protein kinases encoded in the draft
version of the human genome: atypical variations and uncommon domain combinations.
Genome Biol. 3(12): Research0066.1-0066.14.
Krylov, D. M., Nasmyth, K. & Koonin, E. V. (2003) Evolution of Eukaryotic Cell Cycle
Regulation: Stepwise Addition of Regulatory Kinases and Late Advent of the CDKs. Current
Biology 13: 173-177.
Lander, E. S. et al. (2001) The international human genome sequencing consortium. Initial
sequencing and analysis of the human genome. Nature 409: 869-921.
Leonard, C. J., Aravind, L. & Koonin, E. V. (1998) Novel families of putative protein
kinases in bacteria and archaea: evolution of the "eukaryotic" protein kinase superfamily.
Genome Research 8: 1038-1047.
Li, W. H. (1997) Molecular Evolution, Sinauer Associates, Inc., Sunderland, MA.
Lynch, M. & Conery, J. S. (2000) The evolutionary fate and consequences of duplicate
genes. Science 290: 1151-1155.
Manning, G., Whyte, D. B., Martinez, R., Hunter, T. & Sudarsanam, S. (2002) The protein
kinase complement of the human genome. Science 298: 1912-1934.
Manning, G., Plowman, G. D., Hunter, T. & Sudarsanam, S. (2002) Evolution of protein
kinase signaling from yeast to man. Trends Biochem Sci. 27: 514-520.
McLysaght, A., Hokamp, K. & Wolfe, K. H. (2002) Extensive genomic duplication during
early chordate evolution. Nature Genetics 31: 200-204.
Miyata, T. & Suga, H. (2001) Divergence pattern of animal gene families and relationship
with the Cambrian explosion. BioEssays 23: 1018-1027.
Ohno, S. (1970) Evolution by gene duplication. New York: Springer Verlag.
Olson, M. V. & Varki, A. (2003) Sequencing the chimpanzee genome: insights into human
evolution and disease. Nat Rev Genet. 4: 20-28.
Pavlicek, A., Jabbari, K., Paces, J., Paces, V., Hejnar, J. & Bernard), G. (2001) Similar
integration but different stability of Alus and LINEs in the human genome. Gene 276: 39-45.
77
Samonte, R. V. & Eichier, E. E. (2002) Segmental duplications and the evolution of the
primate genome. Nat Rev Genet. 3: 65-72.
S mit, A. F. (1999) Interspersed repeats and other mementos of transposable elements in
mammalian genomes. Curr Opin Genet Dev. 9: 657-663.
Suga, H., Katoh, K. & Miyata, T. (2001) Sponge homologs of vertebrate protein tyrosine
kinases and frequent domain shufflings in the early evolution of animals before the parazoan-
eumetazoan split. Gene 280: 195-201.
Zhang, Z., Harrison, P. & Gerstein, M. (2002) Identification and analysis of over 2000
ribosomal protein pseudogenes in the human genome. Genome Research 12: 1466-1482.
78
CHAPTER VI: GENERAL CONCLUSION
This study is focused on the exploitation of the functional divergence and
evolutionary patterns in vertebrate kinomes and kinase-regulated signaling transduction,
using a combinatorial statistical and evolutionary approach. After a general introduction in
Chapter I, the major findings are reported in Chapter II to Chapter V.
Summary of Findings
In the first study (Chapter II), we investigated the natural history and evolutionary
patterns in the protein tyrosine kinase (PTK) supergene family. PTK is the key mediator in
cellular signaling in metazoans, and is directly associated with a variety of human diseases.
Our analysis showed that site-specific shifted evolutionary rate (altered functional
constraint) is a common pattern in PTK gene family evolution. Our results are summarized
as follows: (1) Both large-scale and small-scale gene duplications are the major evolutionary
force for generating the contemporary PTK superfamily. (2) Substantial functional
divergence occurred after gene duplication(s), characterized by the significant shift in
evolutionary rates. (3) Evolutionary functional divergence is correlated with the phenotypic
functional divergence in paralogous genes. These results not only shed light on the role of
gene duplication in the development of hierarchical PTK-mediated networks, but also
provide impetus for a new approach to predict functionary divergence from evolutionary
changes.
In the research presented in Chapter III, we conducted a statistical and phylogenetic
analysis of the Jak (Janus kinase) gene family. In addition to a typical kinase domain (JH1),
the Jak family possesses a unique noncatalytic pseudokinase domain (JH2). This dual kinase
domain structure provides an ideal model for testing the impact of the (internal) domain
duplication on functional divergence. The impact of gene duplication, which may have given
rise to four tissue-specific member genes, Jakl, Jak2, Jak3 and Tyk2 in the early stages of
vertebrates, is of particular interest in this study as well. Our result shows that both domain
and gene duplications are important for Jak gene family proliferation and evolution: (1)
Shifted selective constraints (or shifted evolutionary rates) are statistically significant
between JH1 and JH2. (2) Predicted amino acid sites responsible for difference between JH1
79
and JH2 by posterior analysis can be classified into two groups: very conserved in JH1 but
highly variable in JH2, and vice versa. JH2 domain appears to have acquired some new
functions that may account for its long-term existence during evolution, since a group of sites
conserved in the JH2 domain are variable in the JH1 domain. (3) In the early vertebrate
lineage, two rounds of gene duplications have generated four tissue-specific vertebrate
isoforms: after the (first) gene duplication, site-specific rate shifts between Jak2/Jak3and
Jakl/Tyk are significant. This divergence took place mainly in the JH1, but not in the JH2
domain, indicating that the JH1 domain probably plays a relatively important role in
functional specification among tissue-specific isoforms.
In Chapter IV, we have explored the functional divergence in major pathway
components of kinase-mediated TGF-|3 signaling pathways from an evolutionary perspective.
Members in gene families involved in this complicated pathway are usually conserved in
sequence but distinct in temporal and tissue-specific expression. Our major findings include:
(1) The altered functional constraint is a common pattern in both ligand and receptor
evolution in TGF-(3 signaling; (2) Gene duplication might be an evolutionary force for
functional divergence; (3) The early vertebrate stage might be an important time span for the
development of tissue specificity and signaling pathways; and (4) The correlation between
structural and functional divergence seems to be evident.
After the analysis of an individual kinase gene family (Jak), protein tyrosine kinase
superfamily, and a kinase mediated signaling transduction pathway (TGF-(3), we have
attempted to globally explore the functional di vergence in the whole set of vertebrate kinases
(kinome) (Chapter V). The age distribution of kinase gene families showed that (I) The
major kinase-related animal specific si gna I -transduction pathways have been generated
through an ancient continuous domain shufflings (or duplications) during the time period
from early stage of eukaryotes to metazoan evolution; (2) Vertebrate tissue-specificity of
signal-transduction is facilitated by large-scale duplication event(s) in the early stage of
vertebrates; and (3) The kinase pseudogenes are generated through either segmental
duplication or retrotransposition very recently. These findings are essentially congruent with
our earlier report that both large scale and small scale gene duplications have significant
contribution during vertebrate evolution.
80
Ongoing and Future work
My long-term research goal is to integrate multiplayer genome information to study
evolutionary/comparative genomics and computational biology. The accomplished
researches reported here represent my effort to systematically study the evolutionary
mechanisms in complex biosystems. using kinase (kinome) as a paradigm. In the short term,
I hope to develop research programs to:
• Exploit the temporal-spatial expression profile of kinome, with a focus on the
statistical tests on publicly available microarray data (Gu and Gu 2003).
• Further investigate the kinome proliferation on functional innovations from tissue-
specificity to molecular pathways, and to reveal the joint distribution of age and
chromosome location of duplication genes.
• Extend the study of functional divergence to other important families in kinase
signaling network. The target families include transcription factors and G-protein
coupled receptors.
81
ACKNOWLEDGEMENTS
A journey is easier when you travel together. Interdependence is certainly more
valuable than independence. This thesis is the result of fives years of work whereby 1 have
been accompanied and supported by many people. I am glad to have the opportunity to
express my gratitude for all of them.
I am deeply indebted to my major advisor. Dr. Xun Gu, whose insight, stimulating
suggestions and encouragement helped me in all the time of research of this thesis. His
enthusiasm and integral view on science has made a deep impression on me. I would also
like to thank other committee members Dr. Dan Nettleton, Dr. Linda Ambrosio, Dr. Gavin
Nay I or and Dr. Daniel Voytas for their support during my Ph.D., especially to Dr. Dan
Nettleton for his guidance and advice during my study for M.S in statistics.
My research work is also strongly supported by the members of Computational
Genomics and Evolution Lab. I want to thank them for all their help, support, interest and
valuable hints. Especially I am obliged to Dr.Yufeng Wang. Our discussions of coursework,
research and life-in-general were always a welcome addition.
I would like to give my special thanks to my family. They have contributed so much
to my life that I could never acknowledge it all in this limited space.
82
VITA
NAME OF AUTHOR: Jianying Gu
DEGREES AWARDED: B.S. in Microbiology, Fudan University, 1995 M.S. in Molecular Genetics, Shanghai Institute of Plant Physiology, 1998 M.S. in Statistics, Iowa State University, 2002
HORNORS AND AWARDS: James Cornette Research Excellence Award in Bioinformatics, Iowa State University, 2002
PROFESSIONAL PUBLICATIONS: Gu J, Gu X (2003) Natural history and functional divergence in protein tyrosine kinases.
Gene. In press. Gu J, Gu X (2003) Induced gene expression in human brain after the split from
chimpanzee. Trends in Generics. 19(2): 63-65. Gu J, Wang Y, Gu X (2002) Evolutionary perspectives for functional divergence of Jak
protein kinase domains and tissue-specific genes. J. Mol. Evol. 54: 725-733. Gu J, Gu X (2002) Evolutionary analysis of functional divergence in TGF-J3 signaling
pathway. Information Sciences. 145 (3-4): 195-204. Gu J, Gu X (2002) Functional divergence in TGF-(3 signaling pathway. In: Proceedings
of the 6th Joint Conference on Information Science. Pp. 1268-1273. Gu X, Wang Y, Gu J (2002) Age-distribution of human gene families showing significant
roles of both large and small-scale duplications in vertebrate evolution. Nature genetics. 31: 205-209.
Gu X, Wang Y, Gu J, Vander Velden K (2002) Evolutionary perspective for functional divergence of gene family and applications in functional genomics. Current Genomics. 3:201-211.
Gu J, Wang Y, Gu X (2001) Statistical testing for protein functional divergence after major evolutionary events. In: Proceedings of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology. Pp. 176-180
Wang Y, Gu J, Gu X (2001) Patterns of functional divergence after gene duplications in vertebrate gene families: altered functional constraints. In: Proceedings of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology. Pp. 181-185
Gu J, Yu G, Zhu J, Shen S (2000) The N-terminal domain of NifA determines the temperature sensitivity of NifA in Klebsiella pneumoniae and Enterobacter cloacae. Science m CMna (Series C) 43(1):8-15.