functional divergence and genome evolution of vertebrate protein

88
Iowa State University Digital Repository @ Iowa State University Retrospective eses and Dissertations 2003 Functional divergence and genome evolution of vertebrate protein kinases (kinome) Jianying Gu Iowa State University Follow this and additional works at: hp://lib.dr.iastate.edu/rtd Part of the Computational Biology Commons , Genetics Commons , and the Molecular Biology Commons is Dissertation is brought to you for free and open access by Digital Repository @ Iowa State University. It has been accepted for inclusion in Retrospective eses and Dissertations by an authorized administrator of Digital Repository @ Iowa State University. For more information, please contact [email protected]. Recommended Citation Gu, Jianying, "Functional divergence and genome evolution of vertebrate protein kinases (kinome)" (2003). Retrospective eses and Dissertations. Paper 1434.

Upload: donhu

Post on 14-Feb-2017

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Functional divergence and genome evolution of vertebrate protein

Iowa State UniversityDigital Repository @ Iowa State University

Retrospective Theses and Dissertations

2003

Functional divergence and genome evolution ofvertebrate protein kinases (kinome)Jianying GuIowa State University

Follow this and additional works at: http://lib.dr.iastate.edu/rtd

Part of the Computational Biology Commons, Genetics Commons, and the Molecular BiologyCommons

This Dissertation is brought to you for free and open access by Digital Repository @ Iowa State University. It has been accepted for inclusion inRetrospective Theses and Dissertations by an authorized administrator of Digital Repository @ Iowa State University. For more information, pleasecontact [email protected].

Recommended CitationGu, Jianying, "Functional divergence and genome evolution of vertebrate protein kinases (kinome)" (2003). Retrospective Theses andDissertations. Paper 1434.

Page 2: Functional divergence and genome evolution of vertebrate protein

Functional divergence and genome evolution of vertebrate

protein kinases (kinome)

by

Jianying Gu

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirement for the degree of

IX)CTOR OF PHILOSOPHY

Major: Bioinformatics and Computational Biology

Program of Study Committee: Xun Gu, Co-major Professor

Dan Nettleton, Co-major Professor Linda Ambrosio

Gavin Naylor Daniel Voytas

Iowa State University

Ames, Iowa

2003

Copyright © Jianying Gu, 2003. All rights reserved.

Page 3: Functional divergence and genome evolution of vertebrate protein

UMI Number: 3105077

UMI UMI Microform 3105077

Copyright 2003 by ProQuest Information and Learning Company.

All rights reserved. This microform edition is protected against

unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company 300 North Zeeb Road

P.O. Box 1346 Ann Arbor, Ml 48106-1346

Page 4: Functional divergence and genome evolution of vertebrate protein

ii

Graduate College

Iowa State University

This is to certify that the doctoral dissertation of

Jianying Gu

has met the dissertation requirements of Iowa State University

Co-major rofessor

Co-major Professor

For the Major Program

Signature was redacted for privacy.

Signature was redacted for privacy.

Signature was redacted for privacy.

Page 5: Functional divergence and genome evolution of vertebrate protein

iii

TABLE OF CONTENTS

CHAPTER I: GENERAL INTRODUCTION 1

Introduction 1

Dissertation Organization 3

References 5

CHAPTER H: NATURAL HISTORY AND FUNCTIONAL DIVERGENCE IN

PROTEIN TYROSINE KINASES 8

Abstract 8

Introduction 8

Methods 9

Results and Discussion 11

Conclusions 23

Acknowledgement 23

References 23

CHAPTER HI: EVOLUTIONARY PERSPECTIVES FOR FUNCTIONAL

DIVERGENCE OF JAK PROTEIN KINASE DOMAINS AND TISSUE-

SPECIFIC GENES 27

Abstract 27

Introduction 28

Methods 29

Results 31

Discussion 40

Acknowledgements 43

References 43

CHAPTER IV: EVOLUTIONARY ANALYSIS OF FUNCTIONAL

DIVERGENCE IN TGF-g SIGNALING PATHWAY 47

Abstract 47

Page 6: Functional divergence and genome evolution of vertebrate protein

iv

Introduction 47

Methods 48

Results and Discussion 49

Conclusions 55

Acknowledgement 56

References 56

CHAPTER V: EVOLUTIONARY PATTERN OF PROTEIN KINASE

EXPANSION IN HUMAN GENOME 59

Abstract 59

Introduction 59

Data and Methods 60

Results 64

Discussion 74

Acknowledgements 75

References 75

CHAPTER VI: GENERAL CONCLUSIONS 78

Summary of Findings 78

Ongoing and Future Work 80

ACKNOWLEDGEMENTS 81

VITA 82

Page 7: Functional divergence and genome evolution of vertebrate protein

1

CHAPTER I: GENERAL INTRODUCTION

Introduction

Over the past two decades, advances in biology have increased exponentially, leading

to an enormous wealth of genomic (the quantitative study of genes, regulatory and noncoding

sequences), transcriptomic (RNA and gene expression), and proteomic (protein expression)

information. Novel ways of using this information are emerging, with the hope that many

unsolved problems in biology and medicine can be more fully explored. Evidently, an

integrative approach that combines computational, quantitative, and experimental analysis

brings new knowledge, novel methods, and innovative technologies to engender a better

understanding of complex biological systems and processes. Many of these post-genomic

attempts are centered at elucidating the function of gene products and functional interacting

networks by unveiling the evolutionary forces that have shaped the network organization.

Gene/domain duplication, functional divergence, site-specific rate shift, and age

distribution of vertebrate gene families

A central component in vertebrate cellular and developmental networks is gene

families, in which a group of genes with common ancestry display sequence similarity but

exert divergent functions (Lundin 1993; Hughes 1994; Spring 1997; Lynch and Conery

2000). It has long been thought that duplication of individual genes, chromosomal segments,

or entire genomes serves as a primary resource for the function novelties in a vast number of

gene families such as HOX (Ohno, 1970; Sidow 1996; Li et al. 2001). Domain

duplication/shuffling has also been postulated to play additional role in increasing the

functional complexity in multi-domain gene-family networks (Doolittle 1995; Henikoff et al.

1997). Yet still under debate, the functional divergence observed in gene families largely

relies on the diverging fates of duplicated genes: (1) non-functionalization, in which one

copy retains the original function and the other copy accumulates deleterious mutations and

becomes a functionless pseudogene; (2) neo-functionalization, in which one of the duplicates

retains its original function while the other is under relaxed selection constraint and may

acquire novel function (Li, 1983); (3) Alternatively, sub-functionalization, in which each of

Page 8: Functional divergence and genome evolution of vertebrate protein

2

duplicates retains complementary part of the ancestral functions (Force et al. 1999; Lynch

and Force 2000).

Provided the importance of gene/domain duplication in gene family innovation, it is

intriguing to quantitatively trace the changes of evolutionary constraints after a duplication

event. There is a general belief that not all, but only a few amino acid residues contribute to

functional divergence after gene/domain duplication (Golding and Dean 1998). How to

identify these residues from the background evolutionary noise is a challenge. Initially,

sequence similarity score was used as an index for the level of functional divergence, with a

caveat of incompetence to detect the true functional divergence, because most amino acid

changes are neutral or nearly neutral which are not directly related to functionality (Eisen

1998). Instead of sequence similarity, pattern of amino acid replacement, including changes

in the rate of replacement, appears to be a better indicator for changes of functional/selective

constraint, which is likely to be inversely related to the evolutionary rate of amino acid

replacement (Li and Graur 1991). Apparently, algorithms based on empirical rules may not

be sufficient to capture the complex stochastic nature of evolution (Fetrow and Skolnick

1998; Montelione and Anderson 1999; Irving et al. 2001). Hence, a statistical framework

that maps the association between inferred site-specific rate shifts and changes in function

after gene/domain duplication is important (Gu 1999; Gaucher et al. 2001; Gu 2001). In

addition to the statistical modeling of the functional divergence after gene/domain

duplication, inferring the age-distribution of gene families that arose from duplication events

will provide a deeper insight in the evolution of fine-tuned hierarchical cell network. The

observation of up to four paralogs of many Drosophila genes in the human genome provides

indirect evidence for the classical (two-round) hypothesis of vertebrate genome duplication

(Ohno 1970). However, over the decades, the distinct modes of large-scale genome

duplication vs. continuous small-scale gene duplication are hotly debated (Martin 2001;

Friedman and Hughes 2001; Spring 1997; Wolfe 2001; McLysaght et al. 2002; Gu et al.

2002). Our recent study of vertebrate gene families (Gu et al. 2002) revealed that both large-

and small-scale gene duplications have significant contribution to build the current hierarchy

of the human proteome. Also recent segmental duplications have been shown to be involved

in building up the extreme dynamism and genomic fluidity over very short periods of

Page 9: Functional divergence and genome evolution of vertebrate protein

3

evolutionary time (Eichler 2001). This series of studies have shed significant light on

complicated mechanisms underlying the current diversity of vertebrate genomes.

Kinome: key machinery in cell network

Among thousands of gene families, kinase superfamiiies that catalyze the reversible

protein phosphorylation play a central role in regulating basic functions including DNA

replication, cell cycle control, gene transcription, protein translation, and metabolism.

Protein phosphorylation is also required for more advanced functions in higher eukaryotes

including cell, organ, and limb differentiation, cell survival, synaptic transmission, cell-

substratum and cell-cell communication, and to mediate complex interactions with the

external environment. Moreover, aberrant protein phosphorylation is commonly related with

cancer and other human diseases. The completion of human genome sequence provides us

an opportunity to decipher the molecular nature of kinase-mediated signal transduction

machinery (Venter et al. 2001). A complement of 518 kinases, defined as kinome (Manning

et al. 2002), was identified in the human genome, constituting about 1.7% of all human

genes. All of the known bona fide protein kinases share a conserved catalytic kinase domain

of -270 amino acids. Based on their characteristic kinase domain, eukaryotic Protein

Kinases (ePKs) can be divided into two distinct families, protein tyrosine kinases (PTKs) and

protein serine/threonine kinases (PSKs). Tyrosine protein kinase, one major animal-specific

subclass, is likely monophyletic, derived from a precursor of protein serine-threonine kinase

during the course of animal evolution, while the major types of PSKs had existed in the early

stage of eukaryotes (Hanks et al. 1988; Hanks and Hunter, 1995; Hunter 1995). In addition

to ePKs, a group of atypical Protein Kinases (aPKs) possess biochemical kinase activity,

despite the lack of sequence similarity to the ePK domain. Interestingly, the structural

similarity of aPKs to the authentic ePK domains may account for their enzymatic activity

(Yamaguchi et al. 2001).

Dissertation Organization

The objective of this study is to explore the functional divergence and evolutionary

pattern in the vertebrate kinome, using a combinatorial bioinformatic and evolutionary

approach. This dissertation is composed of three sections: a general introduction, four

Page 10: Functional divergence and genome evolution of vertebrate protein

4

chapters of research reports, each of which is in journal manuscript format, and a general

conclusion. The core of the five interrelated reports includes:

Chapter II summarizes a comprehensive phylogenetic analysis on major PTK gene

families. We found that the expanding of each PTK family was likely caused by gene or

genome duplication event(s) that occurred before the emergence of bony fishes but after the

vertebrate-amphioxus split. Our further investigation on the evolutionary pattern revealed

that site-specific shifted evolutionary rate (altered functional constraint) is a common pattern

in PTK gene family evolution (Gu 1999).

Chapter III is a case study in Jak kinase family. The study was focused on the

statistical modeling and detection of functional divergence among duplicate genes. Our

result showed that shifted selective constraints (or shifted evolutionary rates) between

homologous JH1 and JH2 domains arc statistically significant. Moreover, we found that

after the (first) gene duplication, site-specific rate shifts between member genes Jak2/Jak3

and Jakl/Tyk are significant, indicating a consequence of functional divergence among these

member genes.

Chapter IV extends the study to kinase-mediate network: TGF-f3 signal transduction

pathway which is triggered by receptor serine/threonine kinases at the cell surface. We have

statistically tested the functional divergence in major pathway components, targeted the

critical amino acid residues responsible for functional divergence. Interestingly, altered

functional constraints appeared to be a common pattern in the evolution of TGF-(3 pathway.

Moreover, we have found the correlation between structural divergence and functional

divergence in the extracellular ligand-binding domain of Type II receptor kinases.

Chapter V extends the study to the impact of evolutionary forces that drive the

vertebrate kinome organization. The important evolutionary events including gene

duplication, domain duplication/shuffling, gene loss, and gene nonfunctionalization, which

might have contribution to the existing kinome, have been identified. The age distribution of

human kinome shows that the major kinase mediated signaling is generated by domain

shuffling/duplications in the early evolution of eukaryotes, and subsequent large scale

gen(om)e duplication is involved in the emergence of tissue-specific or developmental stage-

specific kinase isoforms in the early stage of vertebrate. Our study of kinase pseudogenes

Page 11: Functional divergence and genome evolution of vertebrate protein

5

further demonstrated the fates of duplicate gene: gain new functionality or loss original

function to turn into pseudogenes.

References:

Doolittle RF. (1995) The origins and evolution of eukaryotic proteins. fMor Thrnj j'oc

lowfggfo/ Scz. 349:235-240.

Eichler EE. (2001) Recent duplication, domain accretion and the dynamic mutation of the

human genome. Trends Genet. 17:661-669.

Eisen JA. (1998) Phylogenomics: improving functional predictions for uncharacterized

genes by evolutionary analysis. Genome Res. 8:163-167.

Fetrow JS, Skolnick J. (1998) Method for prediction of protein function from sequence using

the sequencc-to-structure-to-function paradigm with application to lutaredoxins/thioredoxins

and T1 ribonucleases. J Mol Biol. 281:949-968.

Force A, Lynch M, Pickett FB, A mores A, Van YL, Postlethwait J. (1999) Preservation of

duplicate genes by complementary, degenerative mutations. Genetics. 151:1531-1545.

Friedman R, Hughes AL. (2001) Gene duplication and the structure of eukaryotic genomes.

Genome Res. 11:373-381.

Golding GB, Dean AM. (1998) The structural basis of molecular adaptation. Mol Biol Evol.

15:355-369.

Gaucher EA, Miyamoto MM, Benner SA. (2001) Function-structure analysis of proteins

using covarion-based evolutionary approaches: Elongation factors. Proc Natl Acad Sci US A.

98:548-552.

Gu X. (1999) Statistical methods for testing functional divergence after gene duplication.

Mol Biol Evol. 16:1664-1674.

Gu X. (2001) Maximum-likelihood approach for gene family evolution under functional

divergence. Mol Biol £Vo/. 18:453-464.

Gu X, Wang Y, Gu J. (2002) Age distribution of human gene families shows significant roles

of both large- and small-scale duplications in vertebrate evolution. Nat Genet. 31:205-209.

Hanks SK, Quinn AM, Hunter T. (1988) The protein kinase family: conserved features and

deduced phytogeny of the catalytic domains. Science. 241:42-52.

Page 12: Functional divergence and genome evolution of vertebrate protein

6

Hanks SK, Hunter T. (1995) Protein kinases 6. The eukaryotic protein kinase superfamily:

kinase (catalytic) domain structure and classification. FASEB J. 9:576-596.

Henikoff S, Greene EA, Pietrokovski S, Bork P, Alt wood TK, Hood L. (1997) Gene families:

the taxonomy of protein paralogs and chimeras. Science. 278:609-614.

Hughes AL. (1994) The evolution of functionally novel proteins after gene duplication.

Proc R Soc Lond B Biol Sci. 256:119-124.

Hunter T. (1995) Protein kinases and phosphatases: the yin and yang of protein

phosphorylation and signaling. Cell. 80:225-236.

Irving J A, Whisstock JC, Lesk AM. (2001) Protein structural alignments and functional

genomics. Profema. 42:378-382.

Li WH. (1983) Evolution of duplicate genes and pseudogenes. In:Nei M and Keohn RK (eds)

Evolution of Genes and Proteins. Sinauer Associ ates, Sunderland, M A, pp. 14-37

Li WH, Graur D. (1991) Fundamentals of Molecular Evolution. Sinauer, Sunderland, MA,

USA.

Li WH, Gu Z, Wang H, Nekrutenko A. (2001) Evolutionary analyses of the human genome.

Nature. 409:847-849.

Lundin LG. (1993) Evolution of the vertebrate genome as reflected in paralogous

chromosomal regions in man and the house mouse. Genomics. 16:1-19.

Lynch M, Conery JS. (2000) The evolutionary fate and consequences of duplicate genes.

Sconce. 290:1151-1155.

Lynch M, Force A. (2000) The probability of duplicate gene preservation by

subfunctionalization. Genetics. 154:459-473.

Manning G, Why te DB, Martinez R, Hunter T, Sudarsanam S. (2002) The protein kinase

complement of the human genome. Science. 298:1912-1934.

Martin A. (2001) Is tetralogy true? Lack of support for the "one-to-four rule". Mol Biol Evol.

18:89-93.

McLysaght A, Hokamp K, Wolfe KH. (2002) Extensive genomic duplication during early

chordate evolution. Nat Genet. 31:200-204.

Montelione GT, Anderson S. (1999) Structural genomics: keystone for a Human Proteome

Project. Nat Struct Biol. 6:11-12.

Ohno S (1970) Evolution by gene duplication. Springer, New York.

Page 13: Functional divergence and genome evolution of vertebrate protein

7

Sidow A. (1996) Gen(om)e duplications in the evolution of early vertebrates. Curr Opin

Genef Dev. 6:715-722.

Spring J. (1997) Vertebrate evolution by interspecific hybridisation—are we polyploid?

FE&SIgff. 400:2-8.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M,

Evans CA, Holt RA, et al. (2001) The sequence of the human genome. Science. 291:1304-

1351.

Yamaguchi H, Matsushita M, Nairn AC, Kuriyan J. (2001) Crystal structure of the atypical

protein kinase domain of a TRP channel with phosphotransferase activity. Mol Cell. 7:1047-

1057.

Wolfe KH. (2001) Yesterday's polyploids and the mystery of diploidization.

Mzf ikv Gengf. 2:333-341.

Page 14: Functional divergence and genome evolution of vertebrate protein

8

CHAPTER H: NATURAL HISTORY AND FUNCTIONAL DIVERGENCE IN

PROTEIN TYROSINE KINASES

A paper accepted by Gene1

Jianying Gu 2, Xun Gu 2'3

Abstract Cellular signaling is the core of many biological processes including growth,

differentiation, adhesion, motility and apoptosis. The protein tyrosine kinase (PTK)

supergene family is the key mediator in cellular signaling in metazoans, directly associated

with a variety of human diseases. All PTKs contain a highly conserved catalytic kinase

domain, in spite of variable multi-domain structures. Within each PTK gene family, members

exhibit functional divergence in substrate-specificity or temporal/tissue-specific expression,

although their primary function is conserved. After conducting phylogenetic analysis on

major PTK gene families, we found that the expanding of each PTK family was likely caused

by gene or genome duplication event(s) that occurred before the emergence of teleosts but

after the vertebrate-amphioxus split. We further investigated the evolutionary pattern of

functional divergence after gene duplication in those gene families. Our results show that

site-specific shifted evolutionary rate (altered functional constraint) is a common pattern in

PTK gene family evolution.

1. Introduction Protein kinases catalyze the transfer of a phosphate group from its donor to an

acceptor. They can be divided into two major groups: protein tyrosine kinases (PTKs) and

serine/threonine kinases (PSKs) (Hubbard and Till, 2000). Since PTKs play important roles

in regulating intracellular signaling pathways and are responsible for the development of

many cancers, they have served as drug targets for many different disease therapies (Blume-

1 Accepted by Gene, 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Iowa State University. 3 Author for correspondence

Page 15: Functional divergence and genome evolution of vertebrate protein

9

Jensen and Hunter, 2001; Sridhar et al., 2000). Therefore, there are general interests to

explore how much functional-related information can be obtained from molecular

evolutionary analysis.

PTKs can be further divided into receptor tyrosine kinases (RTKs) and non-receptor

(cytosolic) tyrosine kinases. Typically, a cascade of signals is initiated from an RTK after

ligand binding, which leads to receptor dimerization, kinase activation, and

autophosphorylation of tyrosine residues. These phosphorylated tyrosines then serve as

docking sites for recruiting downstream signaling molecules, including non-receptor tyrosine

kinases that can trigger a variety of cell responses. The extensive cross talk between PTK-

triggered pathways increases the complexity of the signaling process (see Schlessinger, 2000

for a review).

The human genome project has opened new opportunity in the study of PTK-

mediated signaling. The first draft of complete human sequence predicted a total of 106

PTKs, 58 are receptor kinases and 48 are non-receptor kinases (Venter et al., 2001). Despite

that different PTK subfamilies have different domain structures and functions, all PTKs share

a conserved kinase domain that is responsible for their catalytic functions and structures. To

unveil the intrinsic functional diversity among PTKs, natural history of gene family

expansion, and the underlying evolutionary mechanism, we performed extensive

phylogenetic-based analysis on 27 PTK families. Our study may provide some new insights

for understanding the complexity of signal-transduction networks in the animal kingdom.

2. Materials and Methods 2.7 cofkcdof*

The full-length sequences of vertebrate PTK gene families used in this study were

obtained from the Hovergen database (http.7/pbil.univ-lvon 1 .fr/). with reference from

literatures (Robinson et al., 2000; Blume-Jensen and Hunter, 2001). Exhaustive PSI-BLAST

search against Non-redundant Protein databases at NCBI was performed to find homologous

sequences in invertebrates such as D. Melanogaster and C. elegans, which were used as

outgroups. The complete alignment of kinase domain of PTKs was obtained from the Pfam

(http://pfam.wustl.edu/). The chromosome synteny of individual PTKs in human genome

Page 16: Functional divergence and genome evolution of vertebrate protein

10

was obtained by the map viewer at NCBI (http://www.ncbi.nlm.nih.gov/cgi-

bin/Entrez/map search/). The final data set including the following PTK families:

(I) Receptor tyrosine kinases: Axl, discoidin domain receptor (DDR), epidermal

growth factor receptor (EGFR), ephrin receptor (EphR), Fibroblast growth factor

receptor (FGFR), hepatocyte growth factor receptor (HGFR), insulin receptor

(InsR), Leukocyte tyrosine kinase (Ltk/Alk), muscle-specific kinase (Musk),

platelet-derived growth factor receptor (PDGFR), Ret, receptor orphan (Ror),

Ros, Ryk, Tie, tropomyosin-related kinase (Trk), and vascular endothelial growth

factor receptor (VEGFR);

(II) Non-receptor tyrosine kinases: Abelson tyrosine kinase (Abl), acetate kinase

(Ack), C-terminal src kinase (Csk), focal adhesion kinase (Fak), fps/fes related

kinase (Fer/Fes), fyn-related kinase (Frk), Janus kinase (Jak), Src, Spleen tyrosine

kinase (Syk), and Tec.

2.2 AfwAipk a/wf owxfyg»

The multiple alignment of each gene family was obtained by ClustalX program

(http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html), followed by manual adjustment.

Local alignments were reconciled with the kinase domain defined in the Pfam database by

Hidden Markov Models (HMMs). Phylogenetic trees were inferred by the Neighbor-joining

(NJ) method using MEGA2.0 (http://megasoftware.net/) with Poisson distance. The

robustness of tree topology to the tree making method was examined by PAUP4.0

(http://paup.csit.fsu.edu/) and PHYLIP (http://evolution.genetics.washington.edu

/phvlip.html/). Bootstrap analysis was carried out to assess support for individual node.

2.3 Tïnw q/geme «fMpKcafio»

A linearized neighbor-joining tree (Takezaki et al., 1995) was used to convert the

(average) Poisson distance of protein sequences to the molecular time scale by using

software package MEGA2.0. In this study, the divergence time of primate-rodent (80 million

year ago, my a), mammal-bird (310 my a), mammal-amphibian (350 mya), and tetrapod-

teleost (430 mya) and vertebrate-Drosophila split (830 mya) were employed as calibrations

(Kumar and Hedges, 1998; Gu, 1998; Wang and Gu, 2000).

Page 17: Functional divergence and genome evolution of vertebrate protein

11

2.4 Type / /zmcfioMa/ dfvefyence amifyiw

Type I functional divergence refers to the evolutionary process that results in altered

functional constraints (or site-specific evolutionary rate shift) between two duplicate genes,

regardless of the underlying evolutionary mechanisms (Gu, 1999). A statistical framework

modeling the functional divergence was used to estimate the coefficient of functional

divergence (0), an indicator of the level of Type I functional divergence of two homologous

gene clusters (Gu, 1999; Wang and Gu, 2001; Gu et al., 2002a). Rejection of null hypothesis

Ho: 0 = 0 provides statistical evidence for shifts in inferred evolutionary rate (or altered

functional constraints) in duplicates.

Moreover, the sites with critical contribution to the overall functional divergence can

be predicted by the posterior analysis. Let Q (k) = P% (St|X) be the posterior probability of a

site k being Si (functional divergence related status) when the amino acid configuration (X)

is observed. Since the alternative status So (functional divergence unrelated status), with

posterior probability P% (Sq|X) = 1- P% (S,|X), means no altered functional constraint, the

predicted residues are only meaningful when Q (k) > 0.5, in which case the posterior odds

ratio R (S,/So) = P (S,|X)/ P (Sq|X) > I. A more stringent cut-off may be Q (k) > 0.67, or

R(S,/So) > 2. The computational software DIVERGE for Type I functional divergence

analysis and prediction is available at http://xgul.zool.iastate.edu/.

3. Results and Discussions 3.1 The natural history of protein tyrosine kinase gene families

PTKs form a large but diverse superfamily with the characteristic catalytic domain

conserved across different gene families. The classification of these gene families is based

upon ligand specificity, biological function and primary structure. Within each gene family,

the member genes are conserved in the sequence as well as the domain structures, but distinct

in tissue or developmental-stage or ligand specificity.

Gene (domain) duplication plays an important role in eukaryote evolution by

providing the primary source for the evolutionary novelty (Ohno, 1970). Recently we (Gu et

al., 2002b) have demonstrated that both large-scale (genome-wide) and small-scale

duplications have significant contributions on the presence of numerous multigene families.

Page 18: Functional divergence and genome evolution of vertebrate protein

12

The phylogenetic tree based on the alignment of kinase domain sequences suggested that the

whole set of PTKs was generated from a series of domain duplications in the early stage of

animal evolution, as marked by the "ancient component" in Gu et al. (2002b). These gene

(domain) duplications have given rise to the current complex network of signal-transducting

in animals (Hanks et al., 1995; Suga et al., 1997).

Furthermore, phylogenetic tree for each gene family in the PTK superfamily has

revealed the impact of genome-wide duplication(s) in the early stage of vertebrates (Gu et al.,

2002b). Moreover, we found no e vidence for the contribution of recently gene duplication

(around or after mammalian radiation) to the origin of tissue-specific PTKs. Because of

space limitation, in the following we only discuss two typical cases.

EphR gene family

To date, 14 highly related ephrin receptors have been identified in vertebrates. They

can be divided into two classes according to the ligand-binding preference: EphA receptors

(EphAl-AS) recognize glycosyl-phosphatidyl-inositol (GPI)-anchored ephrin-A ligands (Al-

A5), whereas EphB receptors (EphBl-B6) preferentially bind to ephrin-B ligands (B1-B3)

that span the membrane via a transmembrane domain (Gale et al., 1996). One exception is

EphA4, a receptor that can bind and respond to B as well as A-class ephrins.

The phylogenetic tree of ephrin receptors was inferred based on the whole length

amino acid sequences, while its root was tentatively determined by the phytogeny of PTK

superfamily. It shows that the A1/A2 ephrin receptor group first branched-off, probably after

the split between vertebrates and Drosophila (Fig.l). The subsequent gene duplication has

given rise to the two distinct classes of ephrin receptors; the rest of Class A and all the Class

B ephrin receptors fall into separate clades with 93% bootstrap value. In each class more

tissue-specific isoforms were generated by substantial gene duplications. Overall, the

evolution and diversity of ephrin receptors was driven by both small-scale and large-scale

gene duplications in the early stage of vertebrates. Of course, more vertebrate homologous

genes (e.g., fishes) are needed to determine the phylogenetic interval of some duplication

events. Nevertheless, molecular-clock analysis (Wang and Gu, 2000; Gu et al., 2002b)

generally supports our notion (Fig.l).

Page 19: Functional divergence and genome evolution of vertebrate protein

13

X'

s,

93

X 67

N 53

99j _ EPA5 HUMAN H- EPA5 RAT si- EPA5 MOUSE

ffd

EPHA5 B»A5 CHICKEN

B»A3 HUMAN B»A3 MOUSE

98 i— EPA3 RAT • EPA3 CHICKEN

r EPA6 HUMAN HtrEPAB MOUSE

94L EPA6 RAT

B4IA3

PHA6

BPHA7 56, EPA7 HUMAN

• EPA7 CHICKEN • EPA7 RAT •BW MOUSE • EPA3 BRARE

r- EPA8 HUMAN iôôi EPA8 MOUSE

5»HA8 EPA4 HUMAN B»A4 MOUSE

EPA4 CHICKBI EPA4 FROG

- EPA4 ZEBRAFiSH

EPH A4

100 r- EPB3 HUMAN —| r EPB3 RAT

97LB=B3 MOUSE — EPB3 CHICKBI

EPB3FROG EPB3 ZEBRAFISH _

EPB4 ZEBRA FISH2 B=B4 ZEBRA F1SH1

EPHB3

82

95 82

100

95 r I

É' 98 f

— BPB4 FROG r EPB4 MOUSE 1— EPB4 HUMAN 100

EPB1 HUMAN B»B1 RAT EPB1 CHICKBI EPB1 FROG

EPB2 HUMAN EPB2 MOUSE EPB2 QUAIL EPK CHICKBI

EPB2FROG

B>HB4

EPHB1

EPHB2

- EPB6 HUMAN -B>B6 MOUSE |B»HB8

H

EPB5 CHICK i, EPA2 HUMAN

EPA2 MOUSE EPA2 FROG

BWA2

• EPA1 HUMAN " •EPA1 MOUSE

• C.elegans • D.melanogaater

• EPA1 CHICKBI B»HA1

Fig. 1. Phylogenetic tree of the Eph receptor gene family inferred by the neighbor joining

method with poisson correction. Eph receptor family contains 14 highly related member

genes which can be divided into two classes Eph A receptors and EphB receptors with

bootstrap value equal to 93%. Several duplication time points were indicated: (A) 403.1

mya; (B) 481.7 mya; (C) 526.0 mya; (D) 435.3 mya; (E) 477.7 mya; (F) 322.5 mya; (G)

596.6 mya; (H) 747.7 mya; (I) 784.0 mya.

Page 20: Functional divergence and genome evolution of vertebrate protein

14

If the rooting of EphR tree is largely correct, the evolution of ligand-binding

preferences can be inferred from the phylogenetic analysis. That is, the ancestor of EphR

may bind only ephrin-A ligands, while the function related to ephrin B ligands binding was

derived relatively recently. This pattern is consistent with the classic theory of function

innovation after gene duplication: one copy kept the original function (ephrin-A ligands) and

the other one acquired novel function (ephrin-B ligands) through accumulation of amino acid

change (Li, 1983). Interestingly, a similar pattern has also been observed within Class B

ephrin receptors: Eph B2 and Eph B3 have been demonstrated to possess partial functional

redundancy in midline guidance of nerve system, implying their common feature in

preserving ancestral functions of Class B receptors (Cowan et al., 2000). Noticeably Eph B6

shows remarkably long branch length. One possibility is that it has undergone loss-of-

function divergence due to the loss of crucial sites for tyrosine kinase activity and the relaxed

functional constraints (Gurniak et al., 1996).

Src ggf#/wmfy

Src cytoplasmic tyrosine kinase family plays crucial roles in a variety of cellular

processes, such as cell cycle control, cell adhesion, cell motility, cell proliferation and cell

differentiation (Thomas et al., 1997). Extensive studies indicate that the complexity of

functional roles of Src kinases mainly comes from their capability of communicating with a

large number of upstream receptors and downstream effectors. The NJ tree shows that

member genes of the Src family can be divided into two major distinct groups: (A) Src, Yes,

Fyn, Pgr and Yrk; and (B) Lyn, Hck, Lck and Blk (Fig. 2). The three nearest neighbors to

the Src family, tyrosine kinases Frk, Brk and Srm, were used to root the NJ tree.

There exists distinct differences between group A and B at gene expression level.

The three group A member genes (Src/Fyn/Yes) are ubiquitously expressed; whereas all the

group B member genes (Lck/Hck/Lyn/Blk) and one group A member gene (Fgr) have more

tissue-restricted expression, mainly in hematopoietic cells. Research also showed significant

structural difference between group A and B Src genes. The resolved crystal structures of

human Src (Group A) and human Hck (Group B) reveal the significant geometrical

difference between these two groups of proteins (Williams et al., 1998). Furthermore, the

observation of swapping the SH2 and SH3 domains of Lck (group B) with the corresponding

Page 21: Functional divergence and genome evolution of vertebrate protein

15

HUM AN

9 9 s r c42

FLY HUMAN

M O U S E HUM AN

M OUSE

HUMAN RAT M OUSE CHICKEN FROG

X.x lph ld lum (F ISH) HUM AN

DOG M OUSE HICKEN

FROG X. he l le r ! (F ISH)

HUM AN M OUSE C H I C K E N

HUMAN _P RAT

991- FROG X. he l le r l (F ISH) CHICKEN

HUMAN M OUSE RAT

9 9 i H U M A N M O U S E r H U M A N

SS_n HUMAN - M OUSE CHICKEN

FROG la loo HUMAN

OUSE 9 9 ' RAT

9 , HUMAN a I HUMAN b

9 , MOUSE a M OUSE b RAT a RAT b FROG

src64B

Src (20ql2-ql3)

Yes (18pll.3)

Fyn (6q21)

Yrk (?)

Fgr (Ip36.2-p36.1)

BIk (8p23-p22)

Lck/Tkl (Ip35-p34.3)

Hck (20qllql2)

Lyn (8ql3-qter)

Frk (?)

Brk (?)

Srm (?) FLY shark

0.10

Fig. 2. Phylogenetic structure of the Src gene family. The Src family can be divided into

two major distinct groups: group A including Src, Yes, Fyn, Fgr and Yrk; group B including

Lyn, Hck, Lck and BIk. Kinases Frk, Brk and Srm are used as outgroups to root the NJ tree.

Identified chromosomal locations of member genes are listed according to human genome

map.

Page 22: Functional divergence and genome evolution of vertebrate protein

16

domains of Src (group A) abolished the regulation of Src activities (Gonfloni et al., 1997)

confirmed the non-replaceable functions. It is noteworthy that the N-terminus SH4 domain is

very short and ubiquitously conserved for all Src member genes, indicating its common

functional roles in the long run of evolution.

3.2 Evidence of gene duplication in PTK

The recent mounting evidence suggested that, in addition to continuous flux of small-

scale gene duplications, there was at least one genome duplication event that occurred during

early chordate evolution (Gu et al., 2002b; McLysaght et al., 2002). It is of particular interest

to examine whether PTK gene families were generated through such gen(om)e

duplication(s).

Age distribution of gene duplication in PTK families

Using the same approach as described by Gu et al. (2002b), we have dated the time of

gene duplication events that gave rise to the PTK families. Among a total of 56 investigated

duplication events, 38 occurred during the short time window 430-750 Mya which falls

between the emergence of teleosts and the split of vertebrate- amph i ox us (Fig. 3).

The estimated time of gene duplications In PTK families

301- 351- 401- 451- 501- 551- 601- 651- 701- 751- 801- 851- 901-350 400 450 500 550 600 650 700 750 800 850 900 950

time (mya)

Fig. 3. Age distribution of PTK gene duplication events. The molecular time scale is

measured as Myr ago. Each bin for the histogram is 50 Myr.

Page 23: Functional divergence and genome evolution of vertebrate protein

17

Gu et al. (2002b) showed that both large- and small- scale duplications contributed to

the current hierarchy of vertebrate genome: While a continuous mode of gene duplication is

exhibited since the vertebrate-fruitfly split, a rapid increase in the number of paralogous

genes was observed during the time period 430-750 Myr ago. The age distribution of PTK

genes suggests the big-band explosion in the early vertebrate stage is a dominant force for

shaping the diversity of functional specificity, where a continuous mode (constant rate of

gene duplication) plays secondary role (chi-squared test with p<O.OOO5). Using a different

approach, a similar conclusion was obtained by Miyata and his coworkers (Iwabe et al.,

1996: see Miyata and Suga, 2001 for review).

Chromosome distribution of paralogous genes

Chromosome mapping of paralogous genes may provide useful information for the

occurrence of genome duplication. We are aware that due to the frequent reshuffling and

natural selection after chromosome duplications, it is difficult to reconstruct the real history

of chromosome duplication events. Nevertheless, the present chromosome synteny still

preserves some trace of the physical re-organization after gene duplication. For instance, a

number of paralogous chromosomal regions (or paralogons) have been successfully

identified recently (Popovici et al., 2001; McLysaght et al., 2002). Such paralogon in PTKs

was reported as well: all the member genes of Fibroblast growth factor receptor (FGFR)

family and seven other families were located in the same vicinity of chromosomal regions:

4p l6, 5q33-35, 8p 12-21, and 10q21 -26 (Pebusque et al., 1998), indicating the possible

occurrence of block duplication.

PDGFR/VEGFR gene family

The PDGFR/VEGFR family comprises by eight member genes: Pdgfr-a and Pdgfr-(3,

colony-stimulating factor 1 receptor (CSF1-R), stem cell factor receptor (SCFR, commonly

known as c-Kit), Fit-3 (growth factor receptor for early hematopoietic progenitors), Vegfr-1,

Vegfr-2, and Vegfr-3. This gene family seems to be vertebrate-specific since homologous

sequence has not been found in invertebrates such as Drosophila and C. elegans. The

analysis of phylogenetic hierarchy along with chromosome synteny provides striking

evidence for the occurrence of the duplication of chromosome blocks during PDGFR family

evolution. These eight member genes appear to have been arisen from a common ancestor,

Page 24: Functional divergence and genome evolution of vertebrate protein

18

based on their sequence similarity and their common structural features (e.g. high hydrophilic

insertion into the catalytic kinase domain). Their di versification may be generated by a

couple of gene duplication events. As shown in Figure 4, the first major duplication event

took place in the early stage of vertebrates, at least prior to the divergence between tetrapods

and teleosts. It involves the tetraploidization of three chromosome regions 13ql 2, 4ql2, and

5q33-35. Consequently, one of the copies evolves to the present VEGFR isoforms 1, 2, and

3. Analogously, Flt3 may represent the other descendant in the region of 13ql2. In contrast,

complex changes have occurred in the close proximities of 4ql2 (Kit and Pdgfr-a and

counterparts) and 5q33-35 (CSF1R and Pdgfr-(3) due to another round of gene duplication

(Rousset et al., 1995). It is striking that the tandemly linked chromosome synteny is largely

preserved over more than 400 million years. One plausible explanation is the crucial

functional role of this group of receptor kinases decreases the possibility of radical

chromosomal re arrangement within the neighborhood.

Src /o/mZy

In the Src gene family, the chromosome synteny also shows an interesting pattern.

As shown in Figure 2, Src and Hck may be the descendant of a common ancestor located

nearby human chromosome 20q, and Fgr and Lck may have arisen from the common

ancestor in the proximity of chromosome lp35-36. Though the entire puzzle of Src

evolution remains unresolved, those observations lead us to hypothesize gene duplication

(chromosome tetraploidization) as an important evolutionary mechanism underlying the

organization of present-day Src genes.

3.3 Site-specific evolutionary rate shift in PTK gene families

We have conducted pair-wise functional divergence analysis between paralogous

genes for each gene families. Gene clusters with less than four sequences were excluded

from analysis due to insufficient information. Table 1 shows the coefficient of functional

divergence (0) of pair-wise comparisons of PTK superfamily. Forty-two (42) out of 43

comparisons showed 0 > 0 with p < 0.05, suggesting that site-specific rate shift after gene

duplication is a common phenomenon in PTKs evolution.

We have noticed an association between predicted functional divergence (via site-

specific rate-shift) and observed (experimental or phenotypic) trait. For example, in the Src

Page 25: Functional divergence and genome evolution of vertebrate protein

19

Chromosome synteny

99

99

67

99

99

99

99

99

23

99

HUMAN 99 , BOVINE

!9_T i GOAT I— PIG

CAT DOG

99 1 DOG HORSE

i MOUSE 99 RAT

POSSUM CHICKEN FROG

•ZEBRAFISH — HUMAN - CAT

r - M OUSE 99 L RAT

99 0 99 CE

99

• FUGU HUMAN MOUSE RAT

CHICKEN FROG — ZEBRAFISH

HUMAN MOUSE FUGU HUMAN

MOUSE

99 HUMAN

99 99

MOUSE - RAT

HUMAN r - M OUSE

n—

99

99 •— RAT CHICKEN

99 | HUMAN I MOUSE

CHICKEN ZEBRAFISH

KIT

CSF1R

4ql2

5q33.2-33.3

PDGFR a 4ql2

| PDGFRp 5q31-32

I FLT3 13ql2

VEGFR1 13ql2

VEGFR2 4ql2

VEGFR 3 5q35.3

Fig. 4. Phylogenetic tree of the PDGFR gene family. Phylogenetic hierarchy along with

chromosome synteny provides evidence for the occurrence of the duplication of chromosome

blocks during the PDGFR family evolution. The chromosome region 13ql2 (Flt3 and

VEGFR), 4ql2 (Kit, PDGFR-a and VEGFR2), and 5q33-35 (CSF1-R, PDGFR-p and

VEGFR3) might be generated from a common ancestral chromosomal block following a

series of chromosome duplications.

Page 26: Functional divergence and genome evolution of vertebrate protein

20

Table 1: The coefficient of functional divergence (0) of pairwise comparisons of tissue-

specific genes of protein tyrosine kinase gene families.

Gene family name Pairwise comparison" 6 ± se

Receptor Kinase

Epidermal Growth Factor Receptor (EGFR)

Egfr (5) ErbB2 (4) 0.306db0.084

Receptor Kinase

Epidermal Growth Factor Receptor (EGFR) ERfr(5) ErbB3/4 (6) 0.184+0.046

Receptor Kinase

Epidermal Growth Factor Receptor (EGFR)

ErbB2 (4) ErbB3/4 (6) 0.132±0.103

Receptor Kinase

Insulin receptor (InsR/Igr-lR/Irr) InsR (7) Igr-1R (8) 0.137±0.041

Receptor Kinase

Platelet-derived growth factor receptor (PDGFR)

Kit (14) Fms (5) 0.215+0.032

Receptor Kinase

Platelet-derived growth factor receptor (PDGFR) Kit (14) Pdgfr (9) 0.161+0.026

Receptor Kinase

Platelet-derived growth factor receptor (PDGFR)

Kit (14) Vegfr(ll) 0.287±0.031

Receptor Kinase

Platelet-derived growth factor receptor (PDGFR)

Fms (5) Pdgfr (9) 0.098±0.035

Receptor Kinase

Platelet-derived growth factor receptor (PDGFR)

Fms (5) VeKfr(ll) 0.342±0.043

Receptor Kinase

Platelet-derived growth factor receptor (PDGFR)

PdK& (9) Vegfr (11) 0.233±0.032

Receptor Kinase

Fibroblast growth factor receptors (FGFR)

F*#l (7) Ff#2(9) 0.275±0.075

Receptor Kinase

Fibroblast growth factor receptors (FGFR) F#1 (7) FK&3 (9) 0.407±0.066 Receptor

Kinase

Fibroblast growth factor receptors (FGFR)

Fgfrl (7) Fgfr4(8) 0.322±0.061 Receptor Kinase

Fibroblast growth factor receptors (FGFR)

Fgfr2 (9) Fgfr3 (7) 0.210±0.074

Receptor Kinase

Fibroblast growth factor receptors (FGFR)

Fgfr2 (9) Fgfr4 (8) 0.342±0.075

Receptor Kinase

Fibroblast growth factor receptors (FGFR)

Fgfr3 (7) Fgfr4 (8) 0.338±0.060

Receptor Kinase

Trk TrkB (5) TrkC (5) 0.527±0.091

Receptor Kinase

Trk TrkB (5) Trk-related (4) 0.754±0.063

Receptor Kinase

Trk

TrkC (5) Trk-related (4) 0.782±0.077

Receptor Kinase

Hepatocyte growth factor receptor Met (6) Ron (6) 0.222±0.037

Receptor Kinase

Ephrin receptor EphA (23) EphB (23) 0.159+0.031"

Receptor Kinase

Tie/Ret Tie (8) Ret (6) 0.444+0.056

Receptor Kinase

Axl Mer (4) Tyro3 (5) 0.298±0.058

Receptor Kinase

Axl Mer (4) Rorl/2 (4) 0.763±0.092

Receptor Kinase

Axl

Tyro3 (5) Rorl/2 (4) 0.796±0.094

Non­receptor kinase

Src Src (6) Yes (6) 0.283±0.095

Non­receptor kinase

Src Src (6) Fyn (8) 0.362+0.073

Non­receptor kinase

Src

Src (6) Lck (4) 0.728±0.094

Non­receptor kinase

Src

Src (6) Lyn (7) 0.728±0.098

Non­receptor kinase

Src

Src (6) Brk/Srm (4) 0.466+0.163

Non­receptor kinase

Src

Yes (6) Fyn (8) 0.424±0.098

Non­receptor kinase

Src

Yes (6) Lck (4) 0.871±0.112

Non­receptor kinase

Src

Yes (6) Lyn (7) 0.661+0.111 Non­receptor kinase

Src

Yes (6) Brk/Srm (4) 0.431±0.222 Non­receptor kinase

Src

Fyn (8) Lck (4) 0.642±0.081

Non­receptor kinase

Src

Fyn (8) Lyn (7) 0.768±0.094

Non­receptor kinase

Src

Fyn (8) Brk/Srm (4) 0.566+0.088

Non­receptor kinase

Src

Lck (4) Lyn (7) 0.582±0.102

Non­receptor kinase

Src

Lck (4) Brk/Srm (4) 0.315+0.127

Non­receptor kinase

Src

Lyn (7) Brk/Srm (4) 0.246±0.179

Non­receptor kinase

Janus kinase Jakl(7) Jak2 (6) 0.351+0.041

Non­receptor kinase

Janus kinase Jakl(7) Jak3 (6) 0.280±0.040

Non­receptor kinase

Janus kinase

Jak2(6) Iak3 (6) 0.361+0.034 a) The number inside the parentheses represents the number of sequences for that

member gene. b) Eph A including Eph A3, Eph A4, EphA5, EphA6, EphA7 and EphA8; EphB including

EphB 1, EphB2, EphB3, EphB4, EphB5 and EphB6.

Page 27: Functional divergence and genome evolution of vertebrate protein

21

family, 0 between two major groups A and B is estimated to be 0.290 ± 0.045, suggesting

substantial alterations in their site-specific selective constraints, reflected by the difference of

expression, structure and functional role between two groups. In particular, in group A

genes, 0 between Fyn and Src is estimated to be 0.362 ± 0.073. This functional divergence is

well demonstrated by the knock-out assay:fyn mice have alterations in a specific region of

brain, the hippocampus, where some layers have more neurons than wild type mice and cause

the respective layers to undulate; in contrast, src mice resembled wild type controls (Grant et

al., 1992).

Amino acid residues responsible for such functional divergence after gene duplication

can be predicted based on a site-specific profile, which represents the posterior probability of

a site being functional divergence related status given the amino acid configuration. By

choosing a suitable cut-off value, we will obtain a list of candidate sites, which could be

mapped onto the secondary or domain structures. For example, in the Trk family, the

posterior analysis predicts 27 amino acid sites critical for functional divergence between

TrkB and TrkC, given the cutoff value Q (k) > 0.7 (Note: 0 between TrkB and TrkC is

estimated to be 0.527 ± 0.091). Three of predicted sites are located in the Leucine-rich

domain, two in the first Ig-like domain, and two in the kinase domain. Interestingly, another

ten sites reside inside the second Immunoglobulin (Ig)-like domain, which has been shown

the dominant element for neurotrophin-binding specificity in the crystallographic studies

(Uifer et al., 1998; Wiesmann et al., 1999). The spatial distribution of these predicted sites

indicates that they are not equally dispersed through the whole coding region. Instead, the

majority (17/27) are located inside the functional domains, indicating the importance of

domain structure in the functional innovation after gene duplication.

The predicted sites also provide hint for the association between site-specific rate

shift and molecular phenotypes, for example, as shown in hepatocyte growth factor receptor

(HGFR) family. The HGFR family induces mitogenic, motogenic and morphogenic cellular

response, as well as tumorgenesis by triggering a multi-step genetic program called 'invasive

growth' including cell-dissociation, invasion of extracellular matrices and growth. Two

member genes (Met and Ron/Sea) have been identified through human to Fugu, a teleost fish

(Fig. 5a). The functional divergence analysis shows significant changes in the functional

Page 28: Functional divergence and genome evolution of vertebrate protein

22

(a)

100

100

HUMAN

•MOUSE

•CHICKEN

100

100

99

100

100

•FROG

• FUGU3

FUGU2

HUMAN

•MOUSE

RAT i_r" 100 I I

Ron

•CHICKEN

FROG

FUGU1

Met

0.1

(b) (c)

Site-specific profile for functional divergence in HGFR fam ily

Alignment position

M et 101 201 301 401 501 601 701 801 901 1001 1101 1201

site k

Category I II

1 4 4 5 5 7 8 4 6 8 8 2 0 2 1 0 4 7

1111 2 1 1 1 4 9 0 0 9 6 4 4 9 0 6

RON_HUMAN LPCAMG TNLRV RON_MOU S E LPCAMG TTIQA RON_CHICKEN LPCAMG DHEEG RON_FROG LPCAMG LLRDQ PRGFR3_FUGU LPCAVG GQSST PRGFR2_FUGU LPCAMG STAVS MET_MOUSE MFSAIE DBASE MET_RAT MFSAIE DBASE MET_HUMAN IFRPIG DBASE MET_CHICKEN MLDAVG DDASE MET_FROG LIISVQ DDASE PRGFR1_FUGU FYESTK DDASE

Fig. 5. Functional divergence analysis of HGFR gene family. The overall coefficient of

functional divergence between Met and Ron is 0 ± se = 0.222 ± 0.037. (a) Phylogenetic tree

of HGFR gene family, (b) The site-specific profile of posterior probability of functional

divergence between Met and Ron. (c) Amino acid configuration of sites with Q (k) > 0.6.

Predicted sites can be divided into two categories: category I contains sites conserved in Ron

but variable in Met; and category II includes sites conserved in Met but variable in Ron.

Page 29: Functional divergence and genome evolution of vertebrate protein

23

constraints between Met and Ron (8 = 0.222 ± 0.037). Furthermore, posterior analysis

implies strong variation in evolutionary rate shift of each site. As shown in figure 5b, the

baseline of posterior probability profile is between 0.2-0.3, suggesting that most amino acid

residues do not have significant effects on the overall functional divergence. Only a small

portion of residues has high posterior probability to be responsible for the functional

divergence between Met and Ron. With the cut-off value Q (k) > 0.6, 11 sites were

predicted, which can be divided into two categories: category I contains sites conserved in

Ron but variable in Met; and vice versa, category II includes sites conserved in Met but

variable in Ron (Fig. 5c). Given the observation that the knock-out mice lacking met have

placental-lethal defect, but ron embryos are viable through the blastcyst stage of

development (Uehara et al., 1995; Muraoka et al., 1999), we speculate that Met has

indispensable functions and is likely under more stringent functional constraint than Ron.

Indeed, the highly conservation of these category II sites in Met may be very important to

maintain the specific Met function.

4. Conclusions In conclusion, our comprehensive evolutionary analysis on PTKs reveals that (1) both

large-scale and small-scale gene duplications are the major evolutionary force for generating

the contemporary PTK superfamily. (2) Substantial functional divergence occurred after

gene duplication(s), characterized by the significant shift in evolutionary rates. (3)

Evolutionary functional divergence is correlated with the phenotypic functional divergence in

paralogous genes. These results not only shed light on the role of gene duplication in the

development of hierarchical PTK-mediated network, but also provide impetus for a new

approach to predict functionary divergence from evolutionary changes.

Acknowledgement We thank Yufeng Wang for helpful discussion. This study is supported by the NIH grant

ROI GM62118toX. G.

References

Page 30: Functional divergence and genome evolution of vertebrate protein

24

Blume-Jensen, P., Hunter, T., 2001. Oncogenic kinase signalling. Nature 411(6835),

355-365.

Cowan, C.A., Yokoyama, N., Bianchi, L.M., Henkemeyer, M., Fritzsch, B., 2000. EphB2

guides axons at the midline and is necessary for normal vestibular function. Neuron 26(2),

417^30.

Gale, N.W., Holland, S J., Valenzuela, D.M., Flenniken, A., Pan, L„ Ryan, T.E.,

Henkemeyer, ML, Strebhardt, K., Hirai, H., Wilkinson, D.G., Pawson, T., Davis, S.,

Yancopoulos, G.D., 1996. Eph receptors and ligands comprise two major specificity

subclasses and are reciprocally compartmentalized during embryogenesis. Neuron 17, 9-19.

Gonfloni, S., Williams, J.C., Hattula, K., Weijland, A., Wierenga, R.K., Superti-Furga, G.,

1997. The role of the linker between the SH2 domain and catalytic domain in the regulation

and function of Src. EMBO J. 16, 7261-7271.

Grant, S.G., O'Dell, T.J., Karl, K.A., Stein, P.L., Soriano, P., Kandel, E.R., J992. Impaired

long-term potentiation, spatial learning, and hippocampal development in fyn mutant mice.

Science 258,1903-1910.

Gu, J., Wang, Y., Gu, X., 2002a. Evolutionary Analysis for functional divergence of Jak

protein kinase domains and tissue-specific genes. J. Mol. Evol. 54, 725-733.

Gu, X., 1998. Early Metazoan divergence was about 830 million years ago. J. Mol. Evol. 47,

369-371.

Gu, X., 1999. Statistical methods for testing functional divergence after gene duplication.

Mol. Biol. Evol. 16, 1664-1674.

Gu, X., Wang Y., Gu, J., 2002b. Age distribution of human gene families shows significant

roles of both large- and small-scale duplications in vertebrate evolution. Nature Genetics 31,

205-209.

Gurniak, C.B., Berg, L.J., 1996. A new member of the Eph family of receptors that lacks

protein tyrosine kinase activity. Oncogene 13(4), 777-786.

Hanks, S.K., Hunter, T., 1995. The eukaryotic protein kinase superfamily: kinase (catalytic)

domain structure and classification. FASEB J. 9, 576-596.

Hubbard, S R., Till, J.H., 2000. Protein tyrosine kinase structure and function.

Annu. Rev. Biochem. 69, 373-398.

Page 31: Functional divergence and genome evolution of vertebrate protein

25

Iwabe, N., Kuma, K., Miyata, T., 1996. Evolution of gene families and relationship with

organismal evolution: rapid divergence of tissue-specific genes in the early evolution of

chordates. Mol. Biol. Evol. 13, 483-493.

Kumar, S., Hedges, S.B., 1998. A molecular timescale for vertebrate evolution. Nature 392,

917-920.

Li, W.H., 1983. Evolution of duplicate genes and pseudogenes. In: Nei M, Keohn RK (eds)

Evolution of genes and proteins. Sinauer Associates, Sunderland, MA, pp 14-37.

McLysaght, A., Hokamp, K., Wolfe, K.H., 2002. Extensive genomic duplication during early

chordate evolution. Nature Genetics 31, 200-204.

Miyata, T., Suga, H., 2001. Divergence pattern of animal gene families and relationship with

the Cambrian explosion. BioEssays 23, 1018-1027.

Muraoka, R.S., Sun, W.Y., Colbert, M.C., Waltz, S.E., Witte, D P., Degen, J.L., Fiiezner

Degen, S.J., 1999. The Ron/STK receptor tyrosine kinase is essential for peri-implantation

development in the mouse. J. Clin. Invest. 103(9), 1277-1285.

Ohno, S., 1970. Evolution by gene duplication. Springer-Verlag, Berlin.

Pebusque, M.J., Coulier, P., Birnbaum, D., Pontarotti, P., 1998. Ancient large-scale genome

duplications: phylogenetic and linkage analyses shed light on chordate genome evolution.

Mol Biol Evol. 15(9), 1145-1159.

Popovici, C„ Leveuglc, M., Birnbaum, D., Coulier, P., 2001. Homeobox gene clusters and

the human paralogy map. FEB S Lett. 491(3), 237-242.

Robinson, D., Wu, Y., Lin, S., 2000. The protein tyrosine family of the human genome.

Oncogene 19,5548-5557.

Rousset, D., Agnes, R., Lanchaume, P., Andre, C., Galiberl, P., 1995. Molecular evolution of

the genes encoding receptor tyrosine kinase with immunoglobulin like domains. J. Mol. Evol.

41,421-129.

Schlessinger, J., 2000. Cell signaling by receptor tyrosine kinases. Cell 103(2), 211-225.

Sridhar, R., Hanson-Painton, O., Cooper, D R., 2000. Protein kinases as therapeutic targets.

Pharm. Res. 17(11), 1345-1353.

Suga, H., Kuma, K., Iwabe, N., Nikoh, N., Ono, K., Koyanagi, M., Hoshiyama, D., Miyata,

T., Intermittent divergence of the protein tyrosine kinase family during animal evolution.

1997. FEES Lett. 412(3), 540-546.

Page 32: Functional divergence and genome evolution of vertebrate protein

26

Takezaki, N., Rzhetsky, A., Nei, M., 1995. Phylogenetic test of the molecular clock and

linearized trees. Mol. Biol. Evol. 12(5), 823-833.

Thomas, S.M., Brugge, J.S., 1997. Cellular functions regulated by Src family kinases. Annu

Rev Cell Dev Biol. 13, 513-609.

Uehara, Y., Minovva, O., Mori, C, Shiota, K., Kuno, J., Noda, T., Kitamura, N., 1995.

Placental defect and embryonic lethality in mice lacking hepatocyte growth factor/scatter

factor. Nature 373(6516), 702-705.

Urfer, R., Tsoulfas, P., O'Connell, L., Hongo, J.A., Zhao, W., Presta, L.G., 1998. High

resolution mapping of the binding site of TrkA for nerve growth factor and TrkC for

neurotrophin-3 on the second immunoglobulin-like domain of the Trk receptors. J. Biol.

Chem. 273(10), 5829-5240.

Venter, J.C., Adams, M.D., Myers, E.W. et al., 2001. The sequence of the human genome.

Science 291,1304-1351.

Wang, Y., Gu, X., 2000. Evolutionary patterns of gene families generated in the early stage

of vertebrates. J. Mol. Evol. 51, 88-96.

Wang, Y., Gu, X., 2001. Functional divergence in the caspase gene family and altered

functional constraints: statistical analysis and prediction. Genetics 158, 1311-1320.

Wiesmann, C, Ultsch, M.H., Bass, S.H., de Vos, A.M., 1999. Crystal structure of nerve

growth factor in complex with the ligand-binding domain of the TrkA receptor. Nature

401(6749), 184-188.

Williams, J.C., Wierenga, R.K., Saraste, M., 1998. Insights into Src kinase functions:

structural comparisons. Trends Biochem Sci. 23, 179-184.

Page 33: Functional divergence and genome evolution of vertebrate protein

27

CHAPTER HI: EVOLUTIONARY PERSPECTIVES FOR FUNCTIONAL

DIVERGENCE OF JAK PROTEIN KINASE DOMAINS AND TISSUE-SPECIFIC

GENES

A paper published in the Journal of Molecular Evolution1

Jianying Gu 2'3, Yufeng Wang2, Xun Gu 2'4

Abstract

Jak (Janus kinase) is a non-receptor tyrosine kinase, which plays important roles in

signal transduction pathways. The unique feature of Jak is that, in addition to a fully

functional tyrosine kinase domain (JHI), Jak possesses a pseudokinase domain (JH2).

Although JH2 lost its catalytic function, experimental evidence has shown that this domain

may have acquired some new but unknown functions. This apparent functional divergence

after the (internal) domain duplication may result in dramatic changes of selective constraints

at some sites. We have conducted a data analysis to test this hypothesis. Our result shows

that shifted selective constraints (or shifted evolutionary rates) between JH1 and JH2

domains are statistically significant. Predicted amino acid sites by posterior analysis can be

classified into two groups: very conserved in JH1 but highly variable in JH2, and vice versa.

Moreover, we have studied the evolutionary pattern of four tissue-specific genes, Jakl, Jak2,

Jak3 and Tyk2, which were generated in the early stages of vertebrates. We found that after

the (first) gene duplication, site-specific rate shifts between Jak2/Jak3and Jakl/Tyk are

significant, presumably as a consequence of functional divergence among these genes. The

implication of our study for functional genomics is discussed.

Introduction

1 Reprinted with permission of Journal of Molecular Evolution, 2002, 54, 725-733. 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Program of Bioinformatics and Computational Biology (BCB), Iowa State University. 3 Primary researcher and author 4 Author for correspondence

Page 34: Functional divergence and genome evolution of vertebrate protein

28

Jak protein, a group of non-receptor tyrosine kinases, plays a crucial role in the

cytokine signaling (Darnell et al. 1994; Leonard and O'Shea 1998). The Jak family consists

of four mammalian members: Jakl, Jak2, Jak3, and Tyk2 (Wilks et al. 1991; Harpur et al.

1992; Rane and Reddy 1994). Homologous Jaks from aves and teleosts, as well as a

Drosophila Jak homologue, hopscotch, have also been sequenced (Binari and Perrimon 1994,

Conway et al. 1997). The distinguishing feature of the Jak protein family is the existence of

tandem kinase (JH1) and pseudokinase (JH2) domains. The tandem kinase domain has all the

functional features of a typical tyrosine kinase domain, while the pseudokinase domain has

lost the catalytic activity (Guschin et al. 1995; Feng et al. 1997; Zhou et al. 1997).

Among metazoan protein tyrosine kinases (PTKs), only Jak protein maintains a

pseudokinase domain (JH2). The reason for this phenomenon remains unclear. JH2 has all

the subdomains that are shared by typical tyrosine kinase domains, indicating a common

origin with other kinase domains, e.g., via internal domain duplication. Thus, these two

domains in Jak proteins provide an interesting case study of functional divergence in proteins

from an evolutionary perspective. It is believed that the lack of some typical motifs, shared

by functional kinase domains, in the pseudokinase domain rendered this domain catalytically

inactive (Wilks et al. 1991; Frank et al. 1995). On the other hand, knock-out evidence has

shown either abrogation or stimulation of kinase activity when the JH2 domain is deleted in

Jak2, Jak3 or Tyk2 (Velazquez et al. 1995; Candotti et al. 1997; Luo et al. 1997). These facts

imply that after the internal domain duplication, the pseudokinase domain has undergone

loss-of-function and gain-of-function simultaneously. An interesting question in molecular

evolution is whether functional alterations in the JH2 domain result in considerable changes

in selective constraints (i.e., different evolutionary rate) at those sites involved. As a

consequence, some well-conserved sites in JH1 domain become highly variable in the JH2

domain, and vice versa. Such a pattern has been described as "covarion behavior" (Gaucher

et. al 2001), or Type I functional divergence (Gu 1999).

In this paper, we use Gu's (1999) method to test the hypothesis that altered selective

constraints at some positions occurred between the JH1/JH2 domains. Then, we studied the

evolutionary pattern among vertebrate Jak tissue-specific genes, and compared this to the

JH1/JH2 domain evolution. The implication of our results for functional divergence and

specification in Jak/Stat pathway is discussed.

Page 35: Functional divergence and genome evolution of vertebrate protein

29

Methods

We searched the Pfam database (http://pfam.wustl.edu/) for sequences that have

tandem kinase or pseudokinase domains of Jak genes. For comparison, the kinase domain

sequences of FGFR (Fibroblast Growth Factor Receptor) and EGFR (Epidermal Growth

Factor Receptor) gene families were also downloaded.

To find all available sequences that belong to Jak gene family, the Gapped BLAST and PSI-

BLAST searches were performed in several protein databases by using the human Jakl gene

as a query sequence (Altschul et al. 1997). After the exhaustive search, partial sequences and

redundant sequences were removed from further analysis. The final data set includes 22

complete vertebrate Jak sequences and 1 Drosophila homologous gene Hopscotch (see figure

4 legend for accession number and species).

Mw&zipk Analysis

The multiple hidden Markov Models (HMM) alignment of tandem kinase and

pseudokinase domains was obtained from Pfam, followed by manual editing. The multiple

alignment of complete Jak sequences was obtained using the program Clustal X (Thompson

et al. 1997). These multiple alignments are available upon request. A phylogenetic tree was

inferred by the neighbor-joining method (Saitou and Nei 1987) using the software MEGA2.0

(http://www.megasoftware.net/) (Kumar et al. 1994). Parsimony (PAUP) and likelihood

methods (PHYLIP) were also used for phylogenetic reconstruction to examine sensitivity to

tree-making method.

The ratio of nonsynonymous rate to synonymous rate (dn/ds) was calculated by using

a modified version of Gojobori and Nei's method (in the software MEGA2.0). When

assuming that the synonymous substitution is virtually neutral, dn/ds > 1 indicates positive

selection, dn/ds < 1 indicates negative selection, and dn/ds = 1 indicates neutral evolution.

Tacfmg f /zmcfiomoZ (o&ereJ aefecfiyg cofwArawf)

Type 1 functional divergence refers to the evolutionary process that results in altered

selective constraints (different evolutionary rates) between two duplicate genes, regardless of

Page 36: Functional divergence and genome evolution of vertebrate protein

30

the underlying evolutionary mechanisms (Gu 1999, 2001). Its functional-structural basis was

well illustrated in the case study of Caspase family (Wang and Gu 2001). Consider two gene

clusters generated by duplication. Briefly speaking, in each cluster an amino acid site is

called an Fi-site (functional divergence-related) if its evolutionary rate differs from that in

the ancestral gene; otherwise it is called an F0-site (i.e., functional divergence-unrelated).

Consequently, in the case of two gene clusters, a site can be in either of two states: (1) So or

(2) Si, where So is defined as a site being Fq in both clusters and Si is defined as a site being

Fi in at least one cluster. The coefficient of Type 1 functional divergence, 8A,B, between two

gene clusters A and B is defined as the probability of a site being in the S, state. Thus, a null

hypothesis of 0 = 0 means that the evolutionary rate is virtually the same between duplicate

genes at each site. Note that many models for rate variation among sites are the special case

of 0 = 0. A probabilistic model has been developed (Gu 1999) to estimate 0 by establishing

the relationship between this type of functional divergence and the observed amino acid

configurations.

Rejection of the null hypothesis (i.e., 0 > 0) provides statistical evidence for altered

selective constraints of amino acid sites after gene duplication. Thus, we can use a site-

specific profile to identify responsible amino acid sites. Specifically we define Qk to be the

posterior probability that site k is in state S\ (0 < Qk < 1). Large Qk indicates a high possibility

that the functional constraint (or, the evolutionary rate) of a site is different between two

clusters.

The software DIVERGE (http://xgul.zool.iastate.edu) was used for the functional

divergence analysis.

FKMcfzbfwz/ dw&zMce omzfyg «

One limitation of the two-cluster analysis described above is that it cannot test

whether one duplicate gene has more shifted evolutionary rates than the other one. This issue

is addressed by a simple method described below that can be applied when more than two

homologous gene clusters are available (Wang and Gu 2001). First, the Type I functional

distance between any two clusters is defined as dF = -In (1-0). Under the assumption of

independence, Wang and Gu (2001) have shown that dp is additive, i.e., for two clusters A

and B, dF (A,B) = bF (A) + bF (B), where bF(x) is the functional branch length of a given gene

Page 37: Functional divergence and genome evolution of vertebrate protein

31

cluster x. Large bp value for a gene cluster indicates the evolutionary conservation may be

shifted at many sites.

The estimated coefficients of Type 1 functional divergence (0) for all the pairs of

clusters can be used to create a matrix of ^ values. Given this matrix, a standard least

squares method can be implemented based on the formula dp (A,B) = bp (A) + bp (B) to

estimate bp for each gene cluster. If bp ~ 0, it indicates that the evolutionary rate of each site

in this duplicate gene has remained nearly the same since the gene duplication event,

indicative that the derived state is more similar to the ancestral state for this particular cluster.

Results

Pattern of functional divergence between tandem kinase and pseudokinase domains

The kinase and pseudokinase domains in Jaks

In addition to the regular kinase domain (JH1), Jak proteins contain a pseudo-kinase

domain (JH2), with a functional role that has not been clearly determined (Aringer et al.

1999). To explore the evolutionary pattern of JH1 and JH2 domains, we reconstructed a

neighbor-joining (NJ) tree, including Jaks and two closely related protein tyrosine kinases,

FGFR and EGFR. The inferred phylogeny shows that tandem kinase (JH1) and pseudokinase

(JH2) domains are evolutionary y distinct (Figure 1). Indeed, tandem kinase domains (JH1) in

Jaks appear to be more closely related to the functional kinase domains of FGFRs and

EGFRs, while the pseudokinase domains (JH2) of Jaks form a unique clade. In fact, we have

found that the JH2 domain had been generated before the emergence of most member genes

of the Protein tyrosine kinase supergene family (data not shown).

It has been suggested that the pseudokinase domains (JH2) no long exhibit the

catalytic activity, but may have acquired some new functions (Aringer et al. 1999). Thus, it is

interesting to test whether this functional divergence resulted in shifted selective constraints

(different evolutionary rates) at some sites between tandem kinase (JH1) and pseudokinase

(JH2) domains. To this end, we estimated the coefficient of functional divergence (0)

between JH1, JH2, EGFR and FGFR domains, as shown in the upper diagonal of Table 1. All

of the estimates are significantly greater than 0 (p <0.01). In particular, 0 jhi,jh2 ± se = 0.412

± 0.049 provides statistical evidence for supporting the hypothesis of altered selective

constraints between tandem kinase (JHI) and pseudokinase (JH2) domains in Jak proteins.

Page 38: Functional divergence and genome evolution of vertebrate protein

32

92

99r 99£l

99

70

99

99 C,

99

99

80

96 99

JAK1_HUMAN JAK1_MOUSE JAK1_CHICKEN

|—JAK1_COMMON CARP 99I—JAK1_ZEBRAFISH

JAK1_PUFFERFISH TYK2_HUMAN

TYK2_MOUSE TYK2_PUFFERFISH

8 J JAK2_HUMAN (ljAK2_PUFFERFISH Y JAK2_MOUSE

JlJAK2_RAT 8SJAK2_PIG —JAK2_ZEBRAFISH JAK2_ZEBRAFISH

JAK3_HUMAN JAK3_MOUSE

99>-JAK3_RAT — JAK3_CHICKEN

JAK3_COMMON CAR F

JH1

991—J' L-T-QQ*—J

99L - JAK3_PUFFERFISH — JAK_DROSOPHILA

EGFR_HUMAN -FGFR_HUMAN

JAK1_HUMAN JAK1_PfG JAK1_MOUSE

85^JAK1_RAT JAK1_CHICKEN JAK1_PUFFERFISH

JAK1_COMMON CARP JAK1_ZEBRAFISH

TYK2_HUMAN TYK2_MOUSE

TYK2_PUFFERFISH

EGFR FGFR

99 92

99

97

93r JAK2_HUMAN 99 JAK2_PIG

jr JAK2__MOUSE 67- JAK2_RAT

JAK2_PUFFERFiSH JAK2_ZEBRAFISH 991 JAK3_HUMAN

1 JAK3_RAT

JH2

81

991

-JAK3_CHICKEN JAK3_COMMON CARP JAK3_PUFFERFISH

0.1

Figure 1. The NJ tree of Jaks, FGFRs and EGFRs based on the sequence alignment of kinase

domains. The statistical reliability of the inferred tree topology was assessed by the bootstrap

technique (Felsenstein 1985).

Page 39: Functional divergence and genome evolution of vertebrate protein

33

Table 1: 0Ab values from all the combinations of the pair-wise comparisons for Jaks, FGFRs,

and EGFRs.

JHI JH2 FGFR EGFR

JHI 0.412±0.049 0.22210.055 0.496±0.071

JH2 0.531 0.566±0.080 0.561±0.080

FGFR 0.251 0.835 0.432±0.096

EGFR 0.685 0.823 0.566 Note: JH1: the tandem kinase domain in Jak; JH2: the pseudokinase domain in Jak. The upper diagonal shows 0AB for all the pair-wise comparisons for Jaks, FGFRs, and EGFRs, where 0As is defined as the coefficient of the functional divergence between cluster A and cluster B. The lower diagonal shows dp(A,B)= = -In (1-0) which is defined as the functional distance between cluster A and cluster B.

To further test the hypothesis that JH2 is more functionally divergent than JH1, we

conducted the functional distance analysis (Wang and Gu 2001). The functional distances

between domains [ dp - -In (1-0) ] are shown in the lower diagonal of Table 1. Subsequently,

the functional branch length (bF) for each domain was estimated by the least-squares method,

which can be illustrated as a star-like tree (Figure 2). The null hypothesis of equal functional

branch lengths is rejected (p < 0.05). Interestingly, the functional branch length of the

pseudokinase domain (JH2) of Jak proteins is about four times greater than that of the kinase

domain (JHI), indicating that the distinct functional role of JH2 is likely generated by the

episodic evolution of kinase domains.

Important amino acid sites for altered functional constraints between JHI and JH2

A site-specific profile based on the posterior probability (Qk) is used to identify

critical amino acid sites that are responsible for functional divergence between tandem kinase

(JHI) and pseudokinase (JH2) domains. Among 212 amino acid sites, 154 (73%) sites have

Qk < 0.5 implying low probability of contribution to functional divergence. For the remaining

58 amino acid sites with Qk > 0.5, 21 of them show a very high probability of being

functional divergence-related [Qk > 0.9]. These 21 sites can be definitively grouped into two

Page 40: Functional divergence and genome evolution of vertebrate protein

34

— JHI (Kinase domain)

_ EGFR (Kinase domain)

— FGFR (Kinase domain)

— JH2 (pseudokinase domain) 0.479 (JH2)

Figure 2. The tree-like topology of kinase domains of Jaks, FGFRs, and EGFRs in terms of

the functional distance bp.

Table 2. Predicting critical amino acid sites responsible for functional divergence between

tandem kinase and pseudokinase domain in Jaks.

I H Position (A:) & Position^) Gt

26 0.997 38 0.998 203 0.992 47 0.967

30 0.975 41 0.960 20 0.968 190 0.959 46 0.967 105 0.956 23 0.966 103 0.954

162 0.963 44 0.931 70 0.959 121 0.918

137 0.957 87 0.906 21 0.922

159 0.913 196 0.906

0.118 (JHI)

0.421 (EGFR)

0.21 (PGFR)

Category I: conserved in the tandem kinase domain; variable in the pseudokinase domain; Category II: conserved in the pseudokinase domain; variable in the tandem kinase domain.

Page 41: Functional divergence and genome evolution of vertebrate protein

35

categories: (1) conserved in the tandem kinase (JHI) domain, but variable in the

pseudokinase (JH2) domain; (II) conserved in the pseudokinase domain, but variable in the

tandem kinase domain (Table 2).

Category I: Of the 12 sites belonging to this category, one site, site 137 (QJS? =

0.957), has been demonstrated to be a determining site for the function of the tandem kinase

domain (JHI), corresponding to the second tyrosine (highlighted Y) of a conserved (E/D)YY

motif in Jak2 protein. This motif, which is located in the activation loop of Jak2, regulates

the kinase activity by autophosphorylation (Feng et al. 1997). In Tyk2, these two consecutive

tyrosines (YY) have also been identified as phosphorylation sites (Gauzzi et al. 1996).

Interestingly, the multiple alignments clearly show that site 137 is invariant in tandem

kinase (JHI) domains. In contrast, the same position in pseudokinase domains (JH2) has a

variety of amino acids with very different chemical properties. For example, some JH2

domains have amino acids with non-polar side chains such as glycine, alanine and proline,

and some of them have uncharged polar amino acids like serine and threonine (Figure 3A).

This observation can be explained as a relaxed selective constraint that was caused by loss-

of-function in phosphorylation in JH2 domains.

Category 11: Nine predicted sites belong to this category (Figure 3B). Among them,

site 103 is predicted to be highly functional-divergence related [QJOS = 0.954], Experimental

data show that a glutamic acid (E)-to-lysine (K) replacement occurring at this site in the

pseudokinase (JH2) domain hyperactivated Jak-Stat pathway in Drosophila and mammalian

species (Luo et al. 1997). It seems likely that after the internal domain duplication, the

tandem kinase domain (JHI) largely maintained the original catalytic function, while the

pseudokinase domain (JH2) may have achieved some unidentified new functions, resulting in

a set of JH2-specific conserved sites.

Pattern of functional divergence between Jak member genes

Functional divergence of Jak gene family

Figure 4 shows the neighbor-joining (NJ) tree of four Jak gene members (JakI, Jak2,

Jak3 and Tyk2). Parsimony method (PAUP) and likelihood method (PHYLIP) give

essentially the same topology (data not shown). The phylogenetic analysis indicates that Jak

member genes were generated by two gene duplications in the early stage of vertebrates, i.e.,

Page 42: Functional divergence and genome evolution of vertebrate protein

36

(A) (B)

JHI

JH2

11112 1111 222234735690 344480029 013606079263 814773510

HUMAN DPDGALPYYSRC HUMAN KERYQGRST MOUSE DPDGALPYY SRC MOUSE KERYQGRST CHICKEN DPDGALPYYSRC

Jakl CHICKEN KERYQGRSV

CARP DPDGALPYYSRC CARP WHRYTARNV ZEBRAFISH DPDGALPYYSRC ZEBRAFISH WHRYTGRNV PUFFERFISH DPDGALPYYSRC PUFFERFISH SDKFTGKNL HUMAN DPDGALPYSSRC HUMAN EEKQKGKNL MOUSE DPDGALPYS SRC MOUSE EEKQKGKNL RAT DPDGALPYSSRC Jak2 RAT EEKQKGKNL PIG DPDGALPYSSRC

Jak2 PIG EEKQKGKNL

PUFFERFISH DPDGALPYSSRC PUFFERFISH EEKQKGKNL ZEBRAFISH DPDGALPYSSRC ZEBRAFISH EEKQKAKSL HUMAN DPHGALPYSSRC HUMAN QQKHRGRSL MOUSE DPDGALPYSSRC MOUSE QQKHRGRSL RAT DPDGALPYSSRC Jak3 RAT QQKHRGRSL CHICKEN DPDGALPYSSRC CHICKEN EQHQTGQSL CARP DPDGALPYSSRC CARP QQSHRQMSF PUFFERFISH DPDGALPYSSRC PUFFERFISH KKSHRQLSI HUMAN DPDGALPYYSRC HUMAN KDRYQHHNL MOUSE DPDGALPYYSRC Tyk2 MOUSE QERYQHQNL PUFFERFISH DPDGALPYSSRC PUFFERFISH INKDQHKRL

HUMAN MDEKIVESSARK HUMAN FAMSWEKRF PIG MDEKIVESSARK PIG FAMSWEKRF MOUSE LDEKI VETS ARK MOUSE FAMSWEKRF RAT LDEKIVETSARK

Jakl RAT FAMSWEKRF

CHICKEN LNNELVESSAMK Jakl CHICKEN FAMSWEKRF CARP KLYEIIQSSAQE CARP FAMSWEKRF ZEBRAFISH KPYEVIQSTAQD ZEBRAFISH FAMSWEKRF PUFFERFISH RVSEWQTSAQT PUFFERFISH FAMSWEKRF HUMAN REQELLKPNTQA HUMAN FAMSWENRF MOUSE REQKLLKPNTQT MOUSE FAMSWEKRF RAT REQELLKPTTQT RAT FAMSWEKRF PIG REQELLKPNTQT JaK2 PIG FAMSWEKRF PUFFERFISH KELQVLKPSAQI PUFFERFISH FAMSWEKRF ZEBRAFISH REEKVLRPSAQT ZEBRAFISH FAMSWEKRF HUMAN HEEKLVHS SAQT HUMAN FAMSWEKRF RAT REEDLVY SNAQT RAT FAMSWEKRF CHICKEN RDEQVLRAAAQS Jak3 CHICKEN FAMSWEKRF CARP TDVTLIKGECNT

Jak3 CARP FAMSWEKRF

PUFFERFISH SNGRFFEGTSQT PUFFERFISH FAMSWENRF HUMAN RVREWES SMRP HUMAN FAMSWEKRF MOUSE RVSQWESGTQP Tyk2 MOUSE FAMSWEKRF PUFFERFISH QVSDVLKTRPRK

Tyk2 PUFFERFISH FVMSWEKRF

Figure 3. Functional divergence related amino acid sites candidate. (A) category I:

conserved in tandem kinase domains (JHI ), variable in pseudokinase domains (JH2); (B)

category II: conserved in pseudokinase domains (JH2), variable in tandem kinase domains

(JHI).

Page 43: Functional divergence and genome evolution of vertebrate protein

37

">nd

\ 99

99

97

99

. MOUSE

RAT

HUMAN

PIG — PUFFERFISH

- ZEBRAFISH

HUMAN

MOUSE

RAT Cc' CHICKEN

CARP

— PUFFERFISH

HUMAN

MOUSE — PUFFERFISH

83

«Lrt

83r HUMAN

PIG

MOUSE

- CHICKEN

99 CARP

L_ 2 99

ZEBRAFISH

PUFFERFISH

Jak2

Jak3

]ïyk2

Jakl

DROSOPHILA

0.1

Figure 4. The phylogenetic tree of the Jak gene family. The neighbor-joining algorithm was

used to infer the topology based on the multiple sequence alignment with Poisson distance.

Bootstrap scores >50% are presented. 1st and 2nd represent the time points of two rounds of

gene duplications. Accession numbers: Jak2: L16956(mouse); U13396(rat);

AF005216(human); AB006011 (pig); AF090382(pufferfish); AJ005690(zebrafish). Jak3:

U09607(human); L40172(mouse); D28508(rat); AF034576(chicken); AF148993(carp);

AF091238(pufferfish). Tyk2:X54637(human); AF173032(mouse); AF090383 (pufferfish).

Jakl: M64174(human); AB036335(pig); M33425(mouse); AF096264 (chicken);

L24895(carp); U82980(zebrafish); U53213(pufferfish). Hopscotch: L26975{Drosophila)

Page 44: Functional divergence and genome evolution of vertebrate protein

38

before the emergence of teleosts. The first gene duplication resulted in the common ancestor

of Jak2/3 and Tyk2/Jakl, followed by the second one resulting in the current four member

genes. Note that Wang and Gu (2000) have shown that many tissue-specific gene families

show a similar pattern, raising a possibility of large-scale duplication(s) in early vertebrates.

As nucleotide change in the regulatory region after gene duplication may be the first

step for duplicate gene preservation (Force et al. 1999), amino acid replacements are

responsible for the functional divergence for specification among tissue-specific isoforms. To

explore whether (site-specific) altered selective constraints occurred during Jak family

evolution, we estimated 0 between Jak member genes, as well as between two groups (Jak2/3

and Tyk2/Jakl) that was generated by the first round gene duplication. The estimation is

based on the phylogenetic tree in figure 4. We have found that (1) (site-specific) altered

selective constraint after the first gene duplication is statistically significant, and (2) the 0

value between Jak2 and Jak3 is 0.019 ± 0.059, indicating overall no significant site-specific

shift of evolutionary rate between them (Table 3 A).

Furthermore, we have estimated the coefficients of Type I functional divergence

between Jak genes for three separate regions: (1) the tandem kinase domain (JHI), (2) the

pseudokinase domain (JH2), and (3) the surrounding region excluding the JHI and JH2

domains (Table 3B). Our results can be summarized as follows. First, after first-round gene

duplication, site-specific altered selective constraint in the JHI domain is statistically

significant (p < 0.01), but not significant in the JH2 domain; there is weak evidence in the

surrounding region (p ~ 0.05). Second, after the second gene duplication, no statistical

evidence is observed for altered selective constraint in any domain. Since the estimate of 0 in

the JHI domain tends to be higher than those of JH2 domain or the surrounding region, we

conclude that JHI domain is mainly responsible for functional divergence among Jak tissue-

specific genes.

Isoform-specific functional divergence of Jak gene family

The functional branch lengths (bp) of Jakl, Jak2, and Jak3, can be estimated from the

pair-wise estimates of 0 between Jak 1, Jak2, and Jak3, respectively. Table 4 shows the

results for the whole length of Jak proteins, and three domains (JHI, JH2, and the

surrounding region), respectively. Interestingly, the level of altered selective constraints of

Page 45: Functional divergence and genome evolution of vertebrate protein

39

Table 3. The pair-wise coefficient of functional divergence (844) for different member genes

of Jak gene family. (A) 0 values for the full length Jak proteins; (B) 0 values for the different

regions of Jak proteins.

(A) Cluster 1 Cluster 2 Gene 8 ± se ̂(ML) LRT^ P value

duplication ^ Jakl Jak2 I 0.208 ± 0.056 13.609 <0.0005 Jakl Jak3 I 0.158 ±0.051 9.502 0.002 Jak2 Jak3 II 0.019 ± 0.059 0.110 0.75

Jakl/Tyk2 Jak2/Jak3 I 0.072 ± 0.027 7.292 0.007 Jakl Jak2/Jak3 I 0.147 ±0.039 14.367 <0.0005

Note: As a single gene cluster, Tyk2 was excluded from the analysis due to the insufficient number of sequences.

(B) Cluster 1 Cluster 2 Gene JHI JH2 Surrounding Full length

duplication

Jakl Jak2 I 0.435+0.140 0.21410.125 0.294+0.100 0.208±0.056 Jakl Jak3 I 0.231+0.086 0.052±0.143 0.178+0.089 0.158+0.051

Jak2 Jak3 II 0.217±0.150 0.001±0.022 0.022+0.100 0.019±0.059

Jakl/Tyk2 Jak2/Jak3 I 0.122±0.050 0.094±0.068 0.101±0.053 0.072±0.027 Jakl Jak2/Jak3 I 0.174±0.067 0.086±0.097 0.225±0.074 0.147±0.039

(1) I represents the first gene duplication; II stands for second gene duplication. (2) se represent standard error.

(3) likelihood ratio statistic.

Table 4. Functional branch lengths (bp) of different regions in three Jak isoforms.

JHI JH2 Surrounding Full length dg/d, (Full length)

Jakl 0.295 0.147 0.261 0.193 0.074

Jak2 0.277 0.095 0.087 0.040 0.089

Jak3 -0.032 -0.094 -0.065 -0.021 0.274

Page 46: Functional divergence and genome evolution of vertebrate protein

40

member genes, measured by bF, follows bp (Jakl) > bp (Jakl) > bp (JakS), while the level of

altered selective constraints of domains follows JHI > surrounding region > JH2. In

particular, the bp for Jak3 is virtually zero for all three domains, whereas JH2 (the

pseudokinase domain) shows no significant functional branch length in Jak2 and Jak3.

Table 4 also includes the ratios of nonsynonymous to synonymous rates (dn/ds) of

Jak member genes, based on human-mouse orthologous members of a gene, which can be

used to measure the difference of selective constraints among member genes of a gene family

(Tsunoyama et al. 1998). Interestingly, Jak3, which has virtually bp = 0, shows the highest

dn/ds ratio. This observed negative association between dn/ds and bF can be interpreted as

follows. In the early stage after gene duplication, the functional divergence may occur in one

of two lineages (measured by large bF value). If this process leads to the acquisition of some

new functions, a stronger functional constraint (measured by low dn/ds value) is expected.

In summary, during Jak tissue-specific gene evolution, Type I functional divergence

(i.e, site-specific altered selective constraints) occurs mainly in the JHI domain of Jakl and

Jak2 genes. As Jakl and Jak2 genes may have evolved specialized functions in their kinase

domains, Jak3 is more likely to have inherited the ancestral function.

Predicting important amino acid sites for Type I functional divergence

Critical amino acid sites responsible for the Type I functional divergence among Jak

member genes can be predicted by Qk, the posterior probability of being functional

divergence-related at site k (Gu 1999). Figure 5 shows the site-specific profile between

Jakl/Tyk2 and Jak2/Jak3 clusters, measured by Qk. It clearly shows that, after the first gene

duplication, only a small portion of sites have undergone shifted rates, as indicated by high

posterior probabilities. These sites may be considered as prime candidates for further

functional assays.

Discussion

Domain and gene duplications are both important for Jak gene family proliferation

and evolution. After gene (or domain) duplication, one copy mainly retains the original

function, whereas the other copy is under relaxed evolutionary constraint that may result in

functional divergence and/or specification (Ohno 1970; Li 1983). In this paper, we have

Page 47: Functional divergence and genome evolution of vertebrate protein

41

Jak1/Tyk2 vs. Jak2/3 0.8

0.6 X

(f)

0.2

1 51 101 151 201 251 301 351 401 451 501 551 601 651 701 751 801 851 901 951

alignment site

Figure 5. Site specific profiles (Q&) for clusters Jakl/Tyk2 vs. Jak2/Jak3.

investigated the evolutionary pattern of tandem kinase (JHI) and pseudokinase domains

(JH2), a unique feature in the super-gene family of tyrosine protein kinases. Our results are

summarized as follows.

In the very early stage of animals, an internal duplication had occurred within the

ancestor of Jak gene, producing two kinase domains. As one domain (JHI) largely retains the

original kinase activity, the other one (JH2) is free to accumulate amino acid replacements

because of functional redundancy. When some key sites were replaced by other amino acids

without serious deleterious effects on the survival, the kinase function of the JH2 domain was

lost. Consequently, some well-conversed sites in functional kinase domains turn out to be

highly variable. Meanwhile, the JH2 domain appears to have acquired some new functions

that may explain its long-term existence during evolution. Despite the lack of substantial

evidence for a physiological role of JH2, we indeed observed a group of sites in the JH2

domain to be very conserved, whereas they were variable in the JHI domain.

It seems likely that the functional divergence between JHI and JH2 domains had

completed before the origin of vertebrates. In the early vertebrate lineage, two rounds of gene

duplications have generated four tissue-specific vertebrate i so forms. Type I functional

divergence (altered selective constraint) is significant after the first round of gene

duplication. This divergence occurred mainly in the JHI domain, but not the case in the JH2

Page 48: Functional divergence and genome evolution of vertebrate protein

42

0.8

0.6 g ë 0.4

0.2

relative alignment position

Figure 6. The correlation of 9 value with sequence alignment. Y-axis represents different 0

values corresponding to the alignment by sliding positions (-10 to +10) relative to the

manually adjusted alignment of kinase domains in Jaks.

domain, indicating that the JHI domain probably plays a major role of in functional

specification among tissue-specific isoforms.

Since our analysis is based on the amino acid alignment, reliability of the multiple-

alignment is crucial for our interpretation. It has been indicated that the coefficient of Type I

functional divergence (0) can be overestimated by the misaligned sequences (Gu 1999,

2001). Indeed, we found a larger estimate of 8 (8 = 0.615) between JHI and JH2 for the

unedited Pfam domain HMM alignment, which includes several apparent misaligned sites.

After making manual corrections at these sites according to the consensus sequence of

protein kinase domains that have been confirmed by the secondary structure, the estimate of

0 is reduced to 0.439. Furthermore, we tested the effect of the alignment by shifting the

alignment window between JHI and JH2 from -10 to +10 positions. Figure 6 shows that all

estimates of 0 that resulted from the shifted alignment are extremely high. Our study suggests

that a reasonably good alignment is a prerequisite for estimating the level of functional

divergence.

Gu's (1999) method applies only to the Type I functional divergence that resulted in

site-specific altered selective constraints among homologous genes within a gene family. For

Page 49: Functional divergence and genome evolution of vertebrate protein

43

example, the amino acid site 137 shows a typical Type I pattern, i.e. invariant tyrosine in all

tandem kinase domains, but highly variable in pseudokinase domains. It should be noted that

it is only one of many approaches to detect functional divergence from the evolutionary

perspective. As indicated by many authors (e.g., Casari et al. 1995; Livingstone et al. 1996;

Gaucher et al. 2001) functional divergence may result in dramatic change in amino acid

properties but not in selective constraints, which is also called Type II functional divergence

(Gu 1999). For instance, cluster-specific sites (or diagnosis site) usually refer to those sites

that, though highly conserved within gene clusters, consist of amino acids that are

dramatically different (e.g., positive-charged vs. negative-charged) between clusters. Another

well-cited example is that site-by-site dependence may be related to the site interaction in the

protein's 3D structure (Pollock et al. 1999). A sophisticated model that includes all these

evolutionary aspects of functional divergence of proteins is certainly desirable. However,

because of large volume of unknown parameters, its efficiency in practice often becomes the

major concern. Therefore, a specific model is useful to test the statistical significance of a

specific evolutionary perspective. Combined with previous studies (e.g., Gu 1999, Gaucher et

al. 2001 ; Wang and Gu 2001), our analysis has indicated that site-specific altered selective

constraint occurs after gene duplications as a result of functional divergence in protein

evolution.

Acknowledgement: We are grateful to Dr. William Nordstrom and Peter Vedell for

constructive comments. This study was supported by NIH grant ROI GM62118 (to X. G). X.

Gu is the DuPont Young Professor.

References

Altschul SF, Madden TL, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman DJ (1997)

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res. 25: 3389-3402

Aringer M, Cheng A, Nelson JW, Chen M, Sudarshan C, Zhou YJ, O'Shea J J (1999) Janus

kinases and their role in growth and disease. Life Sciences 64: 2173-2186

Binari R, Pern mon N (1994) Stripe-specific regulation of pair-rule genes by hopscotch, a

putative Jak family tyrosine kinase in Drosophila. Genes Dev. 8: 300-312

Page 50: Functional divergence and genome evolution of vertebrate protein

44

Candotti F, Oakes SA, Johnston JA, Giliani S, Schumacher RF, Mella P, Fiorini M, Ugazio

AG, B ado lato R, Notarangelo LD, Bozzi F, Macchi P, Strina D, Vezzoni P, Blaese RM,

O'Shea JJ, Villa A (1997) Structural and functional basis for JAK3-deficient severe

combined immunodeficiency. Blood 90: 3996-4003

Casari G, Sander C, Valencia A (1995) A method to predict functional residues in proteins.

Mzf. Arwcf. BmZ. 2: 171-178

Conway G, Margoliath A, Wong-Madden S, Roberts RJ, Gilbert W (1997) Jakl kinase is

required for cell migrations and anterior specification in zebrafish embryos. Proc. Natl. Acad.

USA 94: 3082-3087

Darnell JE Jr, Kerr IM, Stark GR (1994) Jak-STAT pathways and transcriptional activation

in response to I FN s and other extracellular signaling proteins. Science 264: 1415-1421

Felsenstein J (1985) Confidence limits on phytogenies: an approach using the bootstrap.

Evolution 39: 783-791

Feng J, Witthuhn BA, Matsuda T, Kohlhuber F, Kerr IM, Ihle JN (1997) Activation of Jak2

catalytic activity requires phosphorylation of Y1007 in the kinase activation loop. Mol. Cell.

17: 2497-2501

Force A, Lynch M, Pickett FB, Amores A, Y an YL, Postlethwait J (1999) Preservation of

duplicate genes by complementary, degenerative mutations. Genetics 151: 1531-1545

Frank SJ, Yi W, Zhao Y, Goldsmith JF, Gilliland G, Jiang J, Sakai I, Kraft AS (1995)

Regions of the JAK2 tyrosine kinase required for coupling to the growth hormone receptor.

J. gfof. CAem. 270: 14776-14785

Gaucher EA, Miyamoto MM, Benner SA (2001 ) Function-structure analysis of proteins

using covarion-based evolutionary approaches: Elongation factors. Proc. Natl. Acad. Sci.

USA 98: 548-552

Gauzzi MC, Velazquez L, McKendry R, Mogensen KE, Fellous M, Pellegrini S (1996)

Interferon-a-dependent activation of Tyk2 requires phosphorylation of positive regulatory

tyrosines by another kianse. J. Biol. Chem. 271: 20494-20500

Gu X (1999) Statistical methods for testing functional divergence after gene duplication.

Mol. Biol. Evol. 16: 1664-1674

Guschin D, Rogers N, Briscoe J, Witthuhn B, Watling D, Horn F, Pellegrini S, Yasukawa K,

Heinrich P, Stark GR, et al. (1995) A major role for the protein tyrosine kinase JAK1 in the

Page 51: Functional divergence and genome evolution of vertebrate protein

45

JAK/STAT signal transduction pathway in response to interleukin-6. EMBO J. 14: 1421-

1429

Harpur AG, Andres AC, Ziemiecki A, Aston RR, Wilks AF (1992) Jak2, a third member of

the JAK family of protein tyrosine kinases. Oncogene 7: 1347-1353

Kumar S, Tamura K, Nei M (1994) MEGA: Molecular Evolutionary Genetics Analysis

software for microcomputers. Comput. Appl. Biosci. 10: 189-191

Landgraf R, Fischer D, Eisenberg D (1999) Analysis of heregulin symmetry by weighted

evolutionary tracing. Protein Eng. 12: 943-951

Leonard WJ and O'Shea J J (1998) Jaks and STATs: biological implications. Annu Rev

//MfMwnoZ. 16: 293-322

Li WH (1983) Evolution of duplicate genes and pseudogenes. In:Nei M and Keohn RK (eds)

Evolution of Genes and Proteins. Sinauer Associates, Sunderland, MA, pp. 14-37

Li WH (1993) Unbiased estimation of the rates of synonymous and nonsynonymous

substitution. J Mol. Evol. 36: 96-99

Lichtarge O, Bourne HR, Cohen FE (1996) An evolutionary trace method defines binding

surfaces common to protein families../. Mol. Biol. 257: 342-358

Livingstone CD, Barton GJ ( 1996) Identification of functional residues and secondary

structure from protein multiple sequence alignment. Methods Enzymol. 266: 497-512

Luo H, Rose P, Barber D, Hanratty WP, Lee S, Roberts TM, D Andrea AD, Dearolf CR

(1997) Mutation in the Jak kinase JH2 domain hyperactivates Drosophila and mammalian

Jak-Stat pathways. Mol. Cell. Biol. 17: 1562-1571

Ohno S (1970) Evolution by gene duplication. Springer, New York.

Pollock DD, Taylor WR, Goldman N (1999) Coevolving protein residues: maximum

likelihood identification and relationship to structure. J. Mol. Biol. 287:187-198.

Rane SG, Reddy EP (1994) Jak3: a novel JAK kinase associated with terminal differentiation

of hematopoietic cells. Oncogene 9: 2415-2423

Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing

phylogenetic trees. Mol. Biol. Evol. 4: 406-425

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL_X

windows interface: flexible strategies for multiple sequence alignment aided by quality

analysis tools. Nucleic Acids Res. 25: 4876-4882

Page 52: Functional divergence and genome evolution of vertebrate protein

46

Tsunoyama K, Gojobori T (1998) Evolution of nicotinic acetylcholine receptor subunits.

Mol. Bio. Evol. 15: 518-527

Velazquez L, Mogensen KE, Barbieri G, Fellous M, Uze G, Pellegrini S (1995) Distinct

domains of the protein tyrosine kinase tyk2 required for binding of interferon-alpha/beta and

for signal transduction. J. Biol. Chem. 270: 3327-3334

Wang Y, Gu X (2001 ) Functional divergence in Caspase gene family and altered functional

constraints: statistical analysis and prediction. Genetics. 158: 1311-1320

Wilks AF, Harpur AG, Kurban RR, Ralph SJ, Zurcher G, Ziemiecki A (1991) Two novel

protein-tyrosine kinases, each with a second phosphotransferase-related catalytic domain,

define a new class of protein kinase. Mol. Cell. Biol. 11: 2057-2065

Zhou YJ, Hanson EP, Chen YQ, Magnuson K, Chen M, Swann PG, Wange RL, Changelian

PS, G Shea JJ (1997) Distinct tyrosine phosphorylation sites in Jak3 kinase domain positively

and negatively regulate its enzymatic activity. Proc. Natl. Acad. Sci. USA 94: 13850-13855

Page 53: Functional divergence and genome evolution of vertebrate protein

47

CHAPTER IV: EVOLUTIONARY ANALYSIS OF FUNCTIONAL DIVERGENCE IN

TGF-B SIGNALING PATHWAY

A paper pulished in the Information Sciences1

Jianying Gu 2, Xun Gu2'3

Abstract

TGF-(3 cytokines and their interacting receptors form an intricate signaling network

and play important roles in cell differentiation and development. Members in gene families

involved in this complicated pathways are usually conserved in sequence but distinct in

temporal and tissue-specific expression. Starting from an evolutionary perspective, we have

statistically tested the functional divergence in major pathway components, targeted the

critical amino acid residues responsible for functional divergence. Our results show that

altered functional constraints is a common pattern in the evolution of TGF-(3 pathway.

Moreover, we have found the correlation between structural divergence and functional

divergence in the extracellular ligand-binding domain of Type II receptors. This study helps

understand the complex intrinsic mechanisms of signaling.

1. Introduction

The transforming growth factor (3 (TGF-(3) family plays indispensable roles in tumor

suppression, cell proliferation, cell differentiation, tissue morphogenesis, lineage

determination, cell motility and apoptosis [1], TGF-j3 and related factors regulate gene

expression by bringing together Type I and Type II receptors of serine/threonine protein

kinases. Interestingly, TGF-(3 and its downstream receptors are expressed in a well-

coordinated temporal and spatial pattern. This typical ligand-receptor signaling mode

provides a basis for understanding the complex regulation network in higher animal

kingdom. Noticeably, considerable sequence similarity has been found across species in gene

1 Reprinted with permission of Information Sciences, 2002,145(3-4), 195-204. 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Iowa State University. 3 Author for correspondence

Page 54: Functional divergence and genome evolution of vertebrate protein

48

families involved in TGF-(3 pathway, which implies that certain level of functional constraint

exerts during evolution [2], Thus, an interesting question is addressed: what lies beneath the

similar member genes to make the differential expression pattern of each component in this

array of signaling? In this study, we shall perform comprehensive statistical analysis to

investigate such functional divergence among conserved families of TGF-(3 pathway

components from an evolutionary perspective.

2. Methods

2.7. iSegwefices

Amino acid sequences of TGF-J3 family, Type I receptor family and Type II receptor

family were searched against the Non-redundant Protein databases at NCBI by the PSI-

BLAST. Partial (<80% of full length) and hypothetical sequences were removed from further

analysis. The accession numbers are available upon request.

2.2. Multiple Alignments and Phylogenetic Analysis

The multiple alignments were obtained by Clustal X. Phylogenetic trees were inferred

by the neighbor-joining (NJ) method using MEGA2.0 (http:// www. megasoftware.net/).

PAUP4.0 and PHYL1P were used to examine the robustness of tree topology.

2.3. Statistical Analysis on Functional divergence - DIVERGE software

It is well recognized that gene duplication is one of major resources for functional

innovation [3,4]. The functional divergence refers to a particular type of functional

innovation after gene duplication that resulting in altered functional constraints (i.e., different

evolutionary rates) between two duplicate genes [5, 6],

Under a two-state probabilistic model, for each duplicate cluster, the evolutionary rate

at some sites may differ from the ancestor, these sites are called F,-sites (functional

divergence-related); otherwise they are called F0-sites. Rate difference between duplicates is

observed only if an Fi site is present in at least one cluster: a so-called S, status. We define

the coefficient of type I functional divergence (0) as the probability of a site being involved in

functional divergence in at least one gene cluster, i.e., 0 = P(S,). In contrast, a site being F0 in

both clusters holds a So status (i.e., the evolutionary rate of each duplicate remains the same

Page 55: Functional divergence and genome evolution of vertebrate protein

49

as the ancestral state). Maximum likelihood estimate of 0 can be obtained when the

phylogeny of the gene family is given [6],

If 0 is statistically greater than 0, it indicates that functional divergence may have

occurred after the gene duplication. To target the critical amino acid residues responsible for

functional divergence, a posterior analysis is developed to score the probability for each site

being functional divergence related, i.e. P((St|X), where X is the observed amino acid

configuration.

The detailed statistical framework is discussed in Gu [5, 6], and Wang & Gu [7]. The

computational package DIVERGE can be downloaded at http://xgul.zool.iastate.edu/.

2.4. Dating gene duplication events

Linearized NJ trees were used to convert the (average) distance to the geological time

scale. To date the gene duplication events, several speciation time points were used for

calibration: primate-rodent (80 million years ago,my a), mammal-bird (310 mya), mammal-

amphibian (350 mya), tetrapod-teleost (430 mya), vertebrate-Drosophila (830 mya) [8].

3. Results and Discussion

3.1. A scenario of TGF-/3 pathway: Ligand-Receptor Signaling

The most striking feature of TGF-(3 pathway is its fine modulation on a series of

different factors through a tight ligand-receptor interaction. The ligands, TGF-(3 family

members bind to receptors (Type II and Type I, consecutively) that possess intrinsic

serine/threonine kinase activity. Upon ligand binding the receptors form an active

heterotetrameric complex which initiates down-stream intracellular signaling [1],

J. 2. Fw/icfwfwzZ Divergence m Z/gawk

A variety of TGF-(3 ligands with distinct signals have been identified [2]. They

belong to three major families: BMP/GDF (bone morphogenetic proteins/growth and

differentiation factors), TGF-j3, Activin/lnhibin, each of which has undergone substantial

gene duplications. Table 1 shows the ML estimates of 0 values (coefficients of functional

divergence) for pairwise comparison of major member genes. Among 10 estimates, all except

one are significantly greater than 0, suggesting that altered functional constraint after gene

Page 56: Functional divergence and genome evolution of vertebrate protein

50

Table 1. The pairwise coefficient of functional divergence (0) in TGF-j3 superfamily.

subfamily Cluster 1 Cluster 2 0 ± s.e.

TGF-P TGF-P1 TGF-P2 0.338 ± 0.123 TGF-Pl TGF-P3 0.274 ± 0.098 TGF-P2 TGF-P3 0.619 ± 0.126

Activin Activin PA Activin pB 0.164 ± 0.088 Activin PA Activin PC/E 0.287 ± 0.138 Activin (3B Activin PC/E 0.115 ± 0.105*

BMP BMP2 BMP4 0.250 ± 0.066 BMP2/4 BMP7 0.482 ± 0.074 BMP2/4 BMP9/10 0.330 ± 0.069 BMP7 BMP9/10 0.513 ± 0.111

* we fail to reject the null hypothesis: 0 = 0.

duplication is a general pattern to achieve novel functionality. In particularly, we have

surveyed the pattern of functional divergence within BMP2/4 and TGF-(3s.

j.2.7.

Both BMP2 and BMP4 are important factors for bone morphogenesis in vertebrate

[9]. Tissue-specific expression has been found: in the early limb bud, BMP-4 is found in the

mesenchyme and in the ectodermal ridge at the distal end, whereas BMP-2 is only found in

the ridge, but not in the mesenchyme [10]. Two closely related clades in the NJ tree suggests

that BMP2 and BMP4 arc likely to be generated by a gene duplication event in the very early

vertebrate stage, i.e., after the emergence of amphioxus and before the divergence of

tetrapods and bony fishes, approximately 588 mya (Fig. 1A). Extensive studies have

accumulated evidence that this short time span (430-700 mya) is important for the

emergence of major molecular pathways such as signal transduction, and tissue-specific

isoforms in vertebrates [11].

The overall functional divergence between BMP2 and BMP4 is indexed by the

qf/zmcfzoW dzverggfzcg 8 (0.250 ± 0.066). We statistically rejected the null

hypothesis H0: 0 = 0, and proved that altered functional constraints occurred in at least some

of amino acid residues. By using the posterior analysis, we predicted that among total of

aligned 338 sites, 26 sites are critical for the mutual functional divergence, with the cut-off

Page 57: Functional divergence and genome evolution of vertebrate protein

51

(A) (B) Category I Category II

®j— HUMAN 81 j- RABBIT

99jl DEER

[_j— MOUSE

oq '—- RAT

CHCKEN

CLAWED FROG

WESTERN FROG

ZEBRAFISH—»

BMP-2

' r iwuuac fl RAT - DEER

- RABBIT

- CHCKEN

i- CLAWED FROG i- WESTERN FROG

ZEBRAFISH

BMP-2/4 DPP

HUMAN RABBIT DEER MOUSE RAT CHICKEN FROG ZEBRAFISH

HUMAN MOUSE RAT DEER RABBIT CHICKEN FROG ZEBRAFISH

1111222 1357882379233 1649056127314

LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR LLAKTFAPVDRQR

ARPNSLPLIGRSK ARPSSLPLIGRPK ARPSSFPLIGRPK ARPSSFPLIGRPK ARPNSLPLVGRPK

TLSAPVLAIGAGN

VRPNGVIMIQSRK ALSETVPLVLAGR

112 44456789132

4501709097322

EVRRGRSLSIFVK DIRRGRSLTIFAR DVRLGRSLTIFVR DVRRGRAVSLFVK DVRRGRAISLFVK DLRLLPVLTVYVK RVHLAQSMSVKVK NFYMDHAFTIGLS

ELRLEYHLSIGIR ELRLEYHLSIGIR ELRLEYHLSIGIR ELRLEYHLSIGIR ELRLEYHLSIGIR

ELRLEYHLSIGIR

ELRLEYHLSIGIR ELRLEYHLSIGIR

AMPHIOXUS FLY

SLSEIWNTVDRGK KKLHHHGPVAVRK

-MLSEFESSIHIS STAELKSKKLTSR

Fig. 1 Phylogeny and functional divergence in BMP2/4. (A) Phylogenetic tree of BMP2/4

by Neighbor-joining method. Bootstrapping values more than 50% were presented. (B)

Sites predicted to be critical for functional divergence between BMP2 and BMP4.

value P,(S i|X) = 0.51 (as these sites are moved from the alignment, 0 drops to 0). They can

be grouped into two categories: Category I: high conservation observed in BMP2 sites, while

considerable variation in BMP4 sites; Category II, in contrast, almost invariant residues in

BMP4, but variable in BMP2 (Fig IB). Both categories show the typical pattern of functional

divergence: altered evolutionary rates and resulting functional divergence after gene

duplication, in which one duplicate copy retains conserved functions and accepts most of

selection pressure, while the other copy is free to accumulate amino acid changes for

functional novelty.

j.2.2. TGF-jg

TGF-P, whose major functional role is in cell differentiation and tissue repair, is

composed of three endogeneous growth factors: (31, (32, and (33. Despite the sequence and

structural similarities [12, 13], these three isoforms have been shown apparent differences in

activity such as ligand binding affinity [14],

Page 58: Functional divergence and genome evolution of vertebrate protein

52

(A)

t? qq«- F

•ÇË

i

9 p HUMAN

MOUSE

RAT

PIG

CHICKEN

99T '

97fiM

-T-'l—c,

(B)

99r HUMAN

P M ONKEY

L DOG

,_39 r BOVINE

I ' SHEEP

PIG

HORSE

GUINEAPIG

MOUSE

RAT

CHICKEN

FORG

. BASS

— CARP

TGFJ3-1

9, HUMAN

J MONKEY

1 PIG

MOUSE

RAT

ef- HAMSTER

— CHICKEN

— FROG

— CARP

TGFp-2

TGFP-3

TGFb 1 vs. 2

0 . 8

£ 0.6

0.4

CO f-. <J> CM CM CM CM

TGFb 1 vs. 3

0.8

So.6 | 0.4

0.2 Iééé̂ K 3 s a 8 $ R

T— CXJ <N CM CM CM

1

0.8

EO.6 -

S0.4-

0.2

TGFb 2 vs. 3

alignment position

Fig. 2 Phylogeny and functional divergence in TGF-J3 ligand family. (A) Neighbor-joining tree of ligands TGF-[is. Bootstrapping values more than 50% were presented. (B) site-specific profiles for pairwise comparison of three TGF-J3 isoforms. P(Si|X) is the posterior probability for a site being functional-divergence related.

Phylogenetic analysis shows that three TGF-jB isoforms are monophyletic (Fig. 2A).

Two rounds of gene duplication may have occurred to give rise to three paralogous clusters

after the divergence of tetrapods and fishes: the first round generates (31 and the common

ancestor of (32 and (33, then the second round produces (32 and (33. This provides evidence

that TGF-(3 ligand may belong to a large group of cytokines that emerged after large-scale or

genome-wide duplications [11].

The functional divergence analysis reveals that the coefficients of functional

divergence (0) for pair-wise comparison of TGF-(3s are significant, suggesting detectable

functional alteration occurred in the lineage leading to them (Table 1). Furthermore,

posterior statistics implies that strong variation in contribution of each site to the overall

Page 59: Functional divergence and genome evolution of vertebrate protein

53

functional divergence. Figure 2B presents respective site-specific profiles which have the

common pattern. The baseline of posterior probability for a site being functional divergence

related is about 0.2-0.3, implying that most of amino acid residues do not have significant

effects on long-term functional divergence; while several peaks show that only a small

portion of sites are mainly responsible for functional diversification in two paralogs. This

preliminary targeting strategy along with other available computational approaches provide a

start point for future experimental detection of 'hot-spots' in tissue-specific isoforms [15].

3.3. Functional Divergence in Receptors

Two types of receptors, Type II Receptor and Type I Receptor, have been identified

to bind specific TGF-(3 ligands. They belong to a large serine/threonine kinase superfamily

[1].

3.3.1. Type II receptor: structural divergence —> functional divergence ?

Type II receptor family is divided into five subfamilies: Activin Receptor 11 (ActR-

IIA), ActR-IIB, TGFB-I1R, BMPR-II, and AMHR. The existence of invertebrate homologs

(Drosophila, C.elegans, and sponge) suggests the ancient origin of this family: ActR-IIA and

IIB form a more closely related cluster, while others might have diverged earlier (Fig. 3).

It has been recognized that structural difference may result in the functional

significance [16]. Our recent study in apoptotic caspases has shown the structural basis of

functional divergence [7], Analogously, we found that the functional divergence between two

Type II receptor subfamilies, ActR-IIA and AMHR, may directly associate with their

structural alteration. ActR-IIA binds specifically to Activin, while AMHR only recognizes

MIS. The 3- D structure of ActR-IIA has been revealed and its geometrical difference with

AMHR has been discussed [17]. We obtained the coefficient of functional divergence

between them (0 = 0.709 ± 0.232). Moreover, 33 sites are predicted critical for their

functional divergence and then mapped onto the 3-D structure of ActR-IIA. Figure 4 shows

the alignment of extracellular ligand binding domain. Notice that 13 predicted residues lie in

important secondary structures (a helices, (3 sheets or strands). Similar to the pattern in

BMP2/4 subfamilies, these sites are only conserved in one cluster, either ActR-IIA or

AMHR, but variant in the other. Although the intrinsic mechanism is unclear, the sequence

Page 60: Functional divergence and genome evolution of vertebrate protein

54

69, HUMAN MOUSE RAT KMNE SHEB=

CHCKBM FROG

HUMAN RAT MOUSE BCMNE

CHICKEN FFIOG QOUTBH

2ŒRARSH FLY

FFOG

• CELBGANS

ActR-IIA

ActRUB

Put QAH

TGFpR-n

99

94 73

M F

—L

HUMAN RABBTT MOUSE RAT

] ]

65 SPONGE

SPCNGE

BMPR.n

AMHR wt Salk

Q15

Fig. 3 Phylogenetic tree of Type II receptors.

alteration may result in structural changes and further functional changes. Interestingly, the

sites which believed to play important role in ligand recognition: site 106, site 108 and site

109 (as shown in Fig. 4) show a noticeable pattern. In ActR-IIA those positions hold highly

conserved Asp. Asp and Lys, while in AmhR are Arg, Lys and Tyr. The amino acid

configurations are invariable in both clusters but whose biochemical properties are very

different, which suggested that these sites might determine the ligand specificity of different

receptors.

3.3.2. Type / recgpfor

Based on the identity of physiological ligands, Type I receptor is divided into 7 major

subfamilies: TfSR-I, ActR-I(3, ALK7, BMPR-IB, BMPR-IA, ALK1, and ActR-I. NJ tree

shows that they fall into separate clades with high bootstrap support (>95%) (Fig. 5). Based

on kinase domain similarity and signaling activities, they can be clustered into three groups.

Page 61: Functional divergence and genome evolution of vertebrate protein

55

pi Ql pz

1 1 11 1 1 777 88 9 0 0 00 2 3 789 45 6 0 6 89 3 2

HUMAN AILGRSETQECLFFNANWERDRT NQTGVEP-CYGDKDKRRHCFATWKNISGSIEIVKQGCW-LDD RAT AILGRSETQECLFFNANWERDRT NQTGVEP-CYGDKDKRRHCFATWKNISGSIEIVKQGCW-LDD MOUSE AILGRSETQECLFFNANWERDRT NQTGVEP - C Y GDKDKRRHCF ATWKNISG SIEIVKQGCW- LDD BOVINE AILGRSETQECIFYNANWERDRT :--NRTGVES-CYGDKDKRRHCFATWKNISGSIEIVKQGCW-LDD SHEEP AI LGRSETQECIYYNANWEKDKT IAVGIEP-CYGDKDKRRHCFATWKNISSSIDIVKQGCW-LDD CHICKEN AILGRSETQECI YYNANWEKDKT 1AVGIEP -CYGDKDKRRHCF ATWKNI SGS IE I VKQGCW- LDD FROG SILGRSETKECI YYNANWEKDKT NSNGTEI -CYGDNDKRKHCFATWKNI SGSIEIVKQGCW -LDD HUMAN EAPPNRR—TCVFFEAPGVRGSTKTLGELLDTGTELPRAIRCLYSRCCFGIWNLTQDRAQVEMQGCRDSDE MOUSE QVSPNRR--TCVFFEAPGSRGSTKTLGEMVDAGPGPPKGIRCLYSHCCFGIWNLTHGRAGVEMQGCRDSDE RAT QVSPNRR--TCVFFEAPGVRGSTKTLGEWDAGPGPPKGIRCLYSHCCFGIWNLTHGRAQVEMQGCLDSDE RABBIT QAPPNRR—TCVFFEAPGVRGSTKTLGELLDAGPGPPRVIRCLYSRCCFGIWNLTRDQAQVEMQGCRDSDE

P5 P7

1 6 4

HUMAN INC YD - RTDCVEKKD SPEVYFCCCEGN RAT INCYD-RTDCIEKKD SPEVYFCCCEGN MOUSE INCYD-RTDCIEKKD SPEVYFCCCEGN-BOVINE INCYD-RTDCIEKKD SPEVYFCCCEGN-SHEEP INCYD-RTDCIEKKD SPEVYFCCCEGN-CHICKEN INCYD-RNDCIEKKD SPEVFFCCCEGN-FROG INCYN-KSKCTEKKD SPDVFFCCCEGN-HUMAN PGCESLHCDPSPRAHPSPGSTLFTCSCGTD-MOUSE PGCESLHCDPVPRAHPNPSSTLFTCSCGTD-RAT PGCESLHCDPVPRAHPSPSSTLFTCSCGTD-RABBIT PGCESLSCDPSPRARASSGSTLFTCSCGAD-

1 11 7 78 7 90

-MCNEKFSYFPEMEVTQPTSNPVT-PKPP -MCNEKFSYFPEMEVTQPTSNPVT-PKPP -MCNEKFSYFPEMEVTQPTSNPVT-PKPP -MCNERFSYFPEMEVTQPTSNPVT-PKPP -MCNERFSYFPEMEVTQPTSNPVT-PKPP -MCNERFSYFPEMEVTQPTSNPVT-PKPP -YCNEKFYHSPEMEVTQPTSNPVT-TKPP -FCNANYSHLPPPGSPGTPGSQGPQAAPG -FCNANYSHLPPSGNQGAPGPQEPQATPG - F CNANY SHL PPSGNRGAPG PQEPQATPG -FCNANYSHLPPLGGPGTPGPQGPQAAPG

Fig.4 Sequence alignment of extracellular ligand-binding domain of ActR II-A and AmhR.

Secondary structure elements refer to the seqence of human ActR II-A. Predicted critical

residues for functional divergence are listed (site number corresponds to position in sequence

alignment of full length genes).

One group includes T(3R-I, ActR-I(3 and ALK7; another includes BMPR-IB and BMPR-IA;

and the third includes ALK1, and ActR-I.

The functional divergence analysis of pair-wise comparisons of Type I receptor

showed the statistically non-zero coefficient of functional divergence (0) (data not shown).

This analogous result to ligands and Type II receptor suggested that altered functional

constraint after gene duplication is a general pattern at different levels of signaling pathway.

Page 62: Functional divergence and genome evolution of vertebrate protein

56

ActR-ip

ALK7 AM

BMPR-IB

FUŒJ

99FSHEEP

QUAIL CHCKEN

ZEBRAFISH HUMAN

52= QUAIL CHCKEN

XLU16654.PE1 ZEBRAFISH

FLY Tkv

ALKl

ActR-1

BMPR-IA

ELECUWS YQD2 DAF1CAŒL

Fig.5 NJ tree of Type I TGF-P receptor.

4. Conclusions

In this paper, we have explored the functional divergence in TGF-P signaling. Our

findings are: (1) The altered functional constraint is a common pattern in both ligand and

receptor evolution; (2) Gene duplication might be an evolutionary force for functional

divergence, the early vertebrate stage might be an important time for the development of

tissue specificity and signaling pathways (3) There may be correlation between structural and

functional divergence.

Acknowledgement

This study was supported by NIH grant ROI GM62118 (to X.Gu).

References

1. J. Massague, TGF-beta signal transduction, Annu Rev Biochem. 67 (1998) 753-791.

Page 63: Functional divergence and genome evolution of vertebrate protein

57

2. S.J. Newfeld, R.G. Wisotzkey, S. Kumar, Molecular evolution of a developmental

pathway: phylogenetic analyses of transforming growth factor-beta family ligands, receptors

and Smad signal transducers. Genetics 152 (1999) 783—795.

3. W.-H. Li, Evolution of duplicate genes and pseudogenes, in: M. Nei and R.K. Keohn

(Ed.) Evolution of Genes and Proteins. Sinauer Associates, Sunderland, MA, 1983, pp. 14—

37.

4. K. Wolfe, Yesterday's polyploids and the mystery of diploidization, Nat Rev Genet. 2

(2001)333-341.

5. X. Gu, Statistical methods for testing functional divergence after gene duplication. Mol

Biol Evol. 16 (1999) 1664-1674.

6. X. Gu, Maximum likelihood approach for gene family evolution under functional

divergence. Mol Biol Evol. 18 (2001) 453—464.

7. Y. Wang, X. Gu, Functional divergence in the caspase gene family and altered functional

donstraints: statistical analysis and prediction,

Genetics 158 (2001) 1311-1320.

8. S. Kumar, S.B. Hedges, A molecular timescale for vertebrate evolution, Nature 392

(1998) 917-920.

9. R.M. Harland, The transforming growth factor beta family and induction of the vertebrate

mesoderm: bone morphogenetic proteins are ventral inducers, Proc Natl Acad Sci USA. 91

(1994) 10255-10259.

10. J.M. Wozney, et al. Novel regulators of bone formation: Molecular clones and activities,

Science 242 (1988) 1528 -1534.

11. N. Iwabc, K. Kuma, T. Miyata, Evolution of gene families and relationship with

organismal evolution: rapid divergence of tissue-specific genes in the early evolution of

chordates, Mol Biol Evol. 13 (1996) 483—493.

12. A.P. Hinck, et al. Transforming growth factor beta 1: three-dimensional structure in

solution and comparison with the X-ray structure of transforming growth factor beta 2,

Biochemistry 35 (1996) 8517—8534.

13. P R. Mittl, et al. The crystal structure of TGF-beta 3 and comparison to TGF-beta 2:

implications for receptor binding, Protein Sci. 5 (1996) 1261 — 1271.

Page 64: Functional divergence and genome evolution of vertebrate protein

58

14. J.C. Jennings, Comparison of the biological actions of TGF beta-1 and TGF beta-2:

differential activity in endothelial cells, J Cell Physiol. 137 (1988) 167—172.

15. E.A. Gaucher, M.M. Miyamoto, S.A. Benner, Function-structure analysis of proteins

using covarion-based evolutionary approaches: Elongation factors. Proc Natl Acad Sci. USA.

98 (2001) 548-552.

16. G.B. Golding, A.M. Dean, The structural basis of molecular adaptation. Mol Biol Evol.

15 (1998) 355 -369.

17. J. Greenwald, et al. Three-finger toxin fold for the extracellular ligand-binding domain

of the type II activin receptor serine kinase, Nat Struct Biol. 6 (1999) 18—22.

Page 65: Functional divergence and genome evolution of vertebrate protein

59

CHAPTER V: EVOLUTIONARY PATTERN OF PROTEIN KINASE EXPANSION

IN HUMAN GENOME

A paper to be submitted1

Jianying Gu XunGu^

Abstract

Kinases catalyze the reversible protein phosphorylation and play a central role in

fundamental biological functions and in advanced cell-cell co-regulation. A comprehensive

set of kinases (kinome), characterized by a conserved catalytic kinase domain, has recently

been identified in human genome through exhaustive bioinformatic search. We conducted

phylogenetic analysis on major kinase gene families. The age distribution of kinase gene

families showed that ancient continuous domain shufflings (or duplications) during the time

period from early stage of eukaryotes to metazoan evolution have built up the major kinase-

related animal specific signal transduction network. The following large scale gene

duplications in the early stage of vertebrate have contributed to the tissue-specific signal

transduction. Moreover, several kinase pseudogenes are likely to be generated through either

segmental duplication or retrotransposition very recently.

Introduction

Reversible protein phosphorylation plays a central role in regulating many functions

of eukaryotes such as cell cycle control, gene transcription and protein translation, apoptosis,

cell differentiation, cell-cell communication, and to mediate complex interactions with the

external environment. Moreover, aberrant protein phosphorylation is commonly related with

cancer and other human diseases. A comprehensive knowledge of the key actors that

regulate these functions can provide basis for better understanding of underlying mechanism

for cell life and novel drug discovery strategies.

1 manuscript to be submitted. 2 Graduate student and Associate Professor, respectively, Department of Zoology and Genetics, Center for Bioinformatics and Biological Statistics, Iowa State University. 3 Author for correspondence

Page 66: Functional divergence and genome evolution of vertebrate protein

60

The completion of human genome sequence provides us an opportunity to decipher

the molecular nature of signal transduction machinery in human. A total of 478 protein

kinases have been identified in human genome, constituting about 1.7% of all human genes

(Manning et al.. 2002). All of known bona fide protein kinases share a conserved catalytic

kinase domain of ~ 270 amino acids formed a superfamily - eukaryotic protein kinase (ePK).

Two distinct major families, protein tyrosine kinases (PTKs) and protein serine/threonine

kinases (PSKs), are defined based on comparisons of kinase domains. Another set of 40

atypical protein kinases (aPK) have also been identified in human genome, with

demonstrated kinase activity despite the lack of sequence similarity to ePK domain.

It has been widely accepted that the duplications of individual gene, chromosome

segment, or whole genome provide the primary material for origin of functional novelties

(Ohno, 1970). Recently, we have demonstrated that the age distribution of human gene

families can be described as an ancient component, a peak in the early stage of vertebrates

(wave II), and a recent peak during mammalian radiation (wave J) (Gu et al., 2002). We

have further speculated that the major signal transduction pathways emerged through those

duplication events. It is indeed intriguing to study the functional divergence and

evolutionary patterns of the entire set of kinase (kinome, defined by Manning et al. 2002).

Here we report the age distribution of human kinases and pseudogenes and show a complex

history of domain shufflings, gene duplications, and gene losses in kinome expansion.

Data and Methods

Dofo coZ/gcfion

In order to identify the homologues of 518 human protein kinases in vertebrates such

as other mammals, birds, frogs, and fishes, we searched HOVERGEN protein database

(Homologous Vertebrate Genes Database, http://pbil .uni v-1 von 1 .fr/databases /hoverpen html.

release 42 updated on 26 April, 2002) using BLASTP, followed by a pairwise BLAST to

removed the redundant sequences. Invertebrate homologous sequences were obtained from

http://www.kinase.com. and used as outgroup for each gene family. Domain organization of

each kinase gene family was revealed by searching against the Pfam domain database

(http://pfam.wustl.edu/).

Page 67: Functional divergence and genome evolution of vertebrate protein

61

The orthologous mouse DNA sequence for human kinase pseudogenes were obtained

by BLASTN searching against ENSEMBLE mouse genome (http://www.ensemble.org).

Three MARK pseudogenes (MARKpslO, 13, 20) were not analyzed due to short length (<

200 nucleotide acid). With a cutoff E value le-20, two pseudogene SAKps, SGK050ps has

NO mouse hits, which will be excluded from further analysis.

Multiple alignments and phylogenetic inference

Multiple sequence alignment of each gene family was obtained by Clustal X

(Thomson et al., 1997), followed by manual editing. Phylogenetic trees were inferred by the

Neighbor-joining method (Saitou and Nei, 1987) using MEGA2.0 (http://megasoftware.net).

Maximum Parsimony (as implemented in PAUP4.0), and Maximum Likelihood (as

implemented in PHYLIP) were used to examine whether the inferred phylogeny is sensitive

to any tree-making method. The bootstrap resampling with 1,000 pseudo-replicates was

carried out to assess support for each individual branch.

Zn/êrencg of dwpZzcafzo» even#

To identify the underlying evolutionary mechanism (i.e., domain or gene duplication)

of each duplication event, we have examined the domain structure of each human protein

kinase as follows. First, homologous kinase sequences in other vertebrate species (e.g., other

mammals, birds, frogs, and fish) were identified by blasting the 518 human kinase protein

sequences against the HOVERGEN protein database. A total of 221 HOVERGEN families

were retrieved and further analyzed. Second, since the HOVERGEN family is defined solely

based on the overall sequence similarity and may not reflect functionality, we adopted the

hierarchical kinome classification scheme proposed by Manning et al. (2002) which

incorporates sequence inference, structure comparison, and known biological functions.

Third, for multi-domain kinase genes, in addition to the kinase domain, additional domain

structure was revealed by using the HMM and profile-searching in the Pfam database

(http://pfam.wustl.edu/). Finally, combining the information from all the previous analysis,

we classified kinases with similar multi-domain structure into 256 gene families. Here, we

define the duplication event occurred within each gene family as gene duplication (Fig. I),

and that occurred among gene families as domain duplication (Fig. 2). The remaining single-

Page 68: Functional divergence and genome evolution of vertebrate protein

62

domain kinases (e.g., CDK family) that lack of hierarchical domain structure were analyzed

separately.

97

AKT1

JC

HUMAN

RAT

-MOUSE

L BOVIN

CHICKEN

FROG

r- HUMAN

MOUSE

RAT

CHICKEN

r MO

L—F

-ZEBRAFISH

HUMAN

MOUSE

— RAT

-FLY

HUMAN '

RAT

MOUSE

BOVIN

CHICKEN M--FROG

HUMAN

-MOUSE

RAT

CHICKEN

ZEBRAFISH .

HUMAN *

MOUSE

RAT

FLY

WORM-1

WORM-2

rt

0.05 I I I I I I I I I I I I II I I i I I i I I I

0.20 0.15 0.10 0.05 0.00

Fig. 1. Molecular Evolutionary analysis of the Rac heta serine/threonine protein kinase

(AKT) gene family. In vertebrate, there are three member genes: AKT 1, AKT2 and AKT3.

(a) The phylogenetic tree inferred by the neighbor-joining method with Poisson distance.

Bootstrapping values of more than 50% were presented. T, and T% are the time points of the

first and second gene duplications, respectively, (b) The linearized neighbor-joining tree

used in converting evolutionary distance to the (relative) molecular time scale. By the

nearest-neighbor clock, we estimated Tt = 478 Mya and T2 = 522 Mya.

Page 69: Functional divergence and genome evolution of vertebrate protein

63

100 IMERK

kinase FN Ill-like ig-like 81

-L

100 I UFO A X L

I TYR03 100

100 ROR1 kinase Kringle Ig-like

-L ROR

100 Cysteine rich

i ROR2 100

0.2

Fig. 2. Phylogenetic relationship and the domain structures of protein tyrosine kinases AXL

and ROR gene family.

A linearized neighbor-joining tree is efficient to convert the (average) Poisson

distance of protein sequences to the molecular time scale. Here primate-rodent (80 mya),

mammal-bird (310 mya), mammal-amphibian (350 mya), tetrapod-teleost (430 mya) and

vertebrate-Drosophila split (830 mya) were used as calibration in our study (Wang and Gu,

2001). Two steps were taken to adopt a nearest-neighbor clock (Gu et al., 2002). First,

under the tree, the phylogenetic interval of a duplicate event is determined by the nearest

speciation event(s). It can be two sides (before, after), e.g., (teleost-tetrapod, Drosophila-

Page 70: Functional divergence and genome evolution of vertebrate protein

64

vertebrate), or one side, e.g., (teleost-tetrapod, °°) or ( M , teleost-tetrapod) for before or after

the speciation, respectively, where M (the mammalian radiation) is for the lowest side. Note

that the phylogenetic interval is robust against the differential selective constrains in genes

and lineages. Second, the age of duplication event is estimated based on the nearest

speciation event(s). In practice, if the "before"-side is used, an average over two lineages is

taken when they have the same nearest speciation. If the phylogenetic interval is two-sided,

one may use two points for dating (Fig. 1).

To date the ancient duplication events occurred in early Metazoan evolution, a

hierarchical nearest neighbor clock is applied. The dates of the nearest duplication events

inferred by the previous nearest-neighbor clock or nearest speciation points (chosen based on

its applicability) were used to infer the ancient time point of duplication events.

Analysis of kinase pseudogenes

There are two major types of pseudogenes: processed pseudogene and duplicated

pseudogenes. Processed pseudogenes result from retrotransposition, that is, reverse-

transcription of mRNA transcript followed by integration into genomic DNA. Duplicated

pseudogenes arise from duplication of genomic DNA, after which a duplicate gene copy

became a pseudogene. Neighbor-joining tree was inferred from the multiple alignment of

DNA sequences of kinase pseudogene, its parent human kinase gene, homologous mouse

gene, and homo logs in other vertebrate species (Fig. 3). The branch length in NJ tree

represents the number of nucleotide substitution per site (under Kimura 2 parameter model)

along that lineage. Due to relaxation of selection constraint, fast evolution occurred along

the pseudogene lineage. Thus we estimate the time of the events to generate pseudogene

assuming a local molecular clock between the functional human/mouse kinase genes.

Results

Protein kinases: An evolutionary overview

We have reconstructed phylogenetic trees for 518 human protein kinases and 106

human pseudogenes, which include their homologous genes in vertebrate (mammals, birds,

frogs, and fishes), invertebrate (largely fruitfly and worm), and other eukaryotes. Our

phylogenetic analysis is generally consistent with the human kinome classification (Manning

Page 71: Functional divergence and genome evolution of vertebrate protein

65

T = 80 Mya

0.0613

\ 100

8 2

0 0 7

0 . 0 3 9 0

0.0280

• P A K 2

0 . 0 3 7 2

0 . 0 0 9 1

- R A B B I T

100 0.0310

• M O U S E 0 . 0 2 2 5

R A T 0 . 0 2 5 7

0.1354

- P A K 2 p s

• F R O G

0.02

Fig. 3. Molecular phylogeny of the P21-activated kinase (PAKA) gene and pseudogene.

The branch length represents the number of nucleotide substitution per site under a Kimura

two-parameter model. The human-mouse split T = 80 Mya is used to estimate T, to be 18

Mya, the time point to generate PAK pseudogene.

et al., 2002). We have confirmed that most protein kinase families had existed before the

origin of vertebrate and most subfamilies have emerged in the early stage of vertebrates

(Manning et al., 2002b; Miyata and Suga, 2001). Protein tyrosine kinase (PTKs), one major

animal-specific subclass that includes 29 families, is likely monophyletic, derived from a

precursor of protein serine-threonine kinase (PSK) during the course of animal evolution,

while the major types of PSKs had existed in the early stage of eukaryotes (Hanks et al.

1988; Hanks and Hunter, 1995; Hunter et al. 1995).

Comparison of the multiple domain structure of protein kinases showed that the

kinase domain is the only conserved region among most protein kinase families, indicating a

reminiscent important role of domain shuffling/duplication on the emergence of major signal

transduction pathways featured by different protein kinases. In contrast, the multiple domain

structure of most kinase families remains conserved among member genes, suggesting the

impact of gene or genome duplications on protein kinase expanding toward

development/tissue specificity. Moreover, only a very few (~2) human protein kinases are

Page 72: Functional divergence and genome evolution of vertebrate protein

66

recently duplicated (after human/mouse split), in spite of recently finding that a significant

portion of human genome has been duplicated within 40 million years (Lander et al., 2001).

To further characterize the relationship between underlying evolutionary

mechanisms, evolutionary time periods, and the evolution of signal-transduction pathway, we

have developed a robust procedure to infer the events of domain shuffling, gene duplication,

or pseudogene duplication (see Methods). The 524 duplication events identified all occurred

in the ancestral lineage of humans. Among them, 95 are pseudogene duplication/

retrotransposition events, 222 are gene duplications, and 187 are domain duplication events.

We are not able to classify the rest 20 duplication events that occurred among single-domain

kinases as domain or gene duplications.

Remarkably, the age distribution of protein kinases and pseudogenes has recaptured

the unique evolutionary features of human gene families as first described by Gu et al.

(2002). Clearly, the age distribution of human kinome shows a pattern characterized by three

modules (I, II, and 111) (Fig. 4). Module I coincides with the processed pseudogene

explosion (Zhang et al., 2002) or abundant recent tandem or segmental duplications (Eilcher,

2001). Module II indicates kinase gene family expansion in the early stage of vertebrate

evolution by genome (large-scale) duplication (Gu et al., 2002; McLysaght et al., 2002). The

ancient module (III) includes duplication events that took place during and before metazoan

evolution.

Ancient components: domain shuffling/duplication

Domain shuffling (or duplication), which can occur by transposition of gene

fragments or nonhomologous recombination, is thought to have been a major mechanism in

the evolution of new multiple-domain proteins (Doolittle, 1995). Like many other molecules

involved in signal-transduction, most protein kinases are composed of multiple domains with

discrete structural and specific functions. Separate phylogenetic analysis of protein kinase

domain and their associated non-catalytic domains (SH2, SH3, PHD, etc.) has confirmed that

the kinase superfamily was indeed generated by shuffling of domains from a few more

ancient kinases (Krupa and Srinivasan, 2002).

It is known that time estimation of ancient evolutionary events (either duplication or

speciation) may have broad sampling variance or bias (Gu, 1998). Nevertheless, the age

Page 73: Functional divergence and genome evolution of vertebrate protein

67

B. a

S ' 3

III

<1.0 1.0-1.5 1.5-2.0 2.0-2.6 2.5-3.0 3.0-3.5 3.5-4.0 4.0-4.6 4.5-5.0 >5.0

molecular time scale (Byr ago)

«p ^

molecular time scale (Myr ago)

Fig. 4. Age distribution of human kinases. The molecular time scale is measured as Myr ago

(Mya). Each bin for the histogram is 50 Myr, except for the most ancient one (>1,500 Myr

ago). 1 Byr, 1 billion years ( 1,000 Myr).

Page 74: Functional divergence and genome evolution of vertebrate protein

68

distribution of domain shuffling events in protein kinases shows that the majority of domain

duplication events happened during very ancient time period. As shown in Fig. 4, no domain

shuffling event is found after 550 mya, and the majority had occurred during the time period

from early stage of eukaryotes to metazoan evolution. Conserved motifs shared with

eukaryotes protein kinases were identified in bacteria, as well as in archaea, which inferred

the existence of an ancestral protein kinase prior to the divergence of eukaryotes, bacteria

and archaea (Leonard et al, 2002). However, the biological reason why domain shuffling

virtually ceased after the origin of vertebrates remains unclear.

On the basis of phylogenetic analysis of eight animal-specific gene families including

sponge protein tyrosine kinases, Suga et al. (2001) suggested that most domain shuffling

events were going back to dates before the parazoan-sumetazoan split (~ 940 mya), the

earliest divergence among extant animal phyla. Our comparative analysis, however, does not

support this notion. Rather, we provide evidence (Fig. 4) that during the time period from

early stage of eukaryotes to metazoan evolution, roughly corresponding to 2,000 mya (2.0

bya, billion yeas ago) to 700 mya, domain shuffling in protein kinases had occurred in a

fashion of continuous flux, with an approximate constant rate of 0.7x10 * event per year per

genome. Before this time period (very early eukaryote stage, or prior to eukaryote-archaea

split, > 2.0 bya), the rate of domain shuffling was down to -50%, while after this time period

(vertebrates), the rate is virtually zero.

Large scale duplication of human protein kinases

It has been proposed long time ago that genome duplications play important role in

building up the complexity of human genome (Ohno, 1970). Our recent analysis of age

distribution of human gene families has confirmed a rapid explosion of genes in the early

stage of vertebrate (Gu et al., 2002), resulting in tissue-specific (developmental stage-

specific) isoforms. Molecular phylogeny-based analysis of gene families involved in cell-

cell communication and developmental control (e.g., protein tyrosine kinases, phosphatases,

etc.) has shown that extensive gene duplications occurred in a period around or immediately

before the divergence of tetrapods and fishes (Miyata and Suga, 2001). The recent

completion of human kinome (Manning et al., 2002) provides us an opportunity to study the

Page 75: Functional divergence and genome evolution of vertebrate protein

69

evolution pattern of the complete set of human kinases, a better understanding for the whole

set of human genes.

Based on sequence similarity (kinase domain similarity), domain organization and

known biological functions, we identified 81 gene families with two human kinase member

genes, 34 gene families with three human kinase member genes, and 14 gene families with

four human kinase member genes. Our classification generated slightly different kinase gene

families compared to the kinome classification scheme. No evidence supporting the 1:4 rule

(i.e., there is one gene in Drosophila but four homologous genes in human) exists in the

human kinome evolution. Moreover, we also examined the phylogeny of gene families with

four member genes. Two types of topologies can be observed ((A, B) (C, D)) or (((A, B), C),

D). There is no evidence that the topology of ((A, B), (C, D)) as expected by 2R genome

duplication hypothesis occurred more frequently than the other topology.

Among 222 estimated kinase gene duplication events occurred within the gene

families. 135 (~ 61%) occurred in the time period before the emergence of teleosts but after

the vertebrate-amphioxus split (430-750 Mya) (module II in Figure 4). This rapid increase in

the number of paralogous human kinases observed in the early stage of vertebrate evolution

is as expected from the genome duplication hypothesis. Human tissue specific isoforms of

protein kinases are involved extensively in signal transduction in eukaryotic cells, as well as

in controlling many other cellular processes. Thus genome duplication may have major

contribution to the tissue- (or developmental stage-) specific kinase isoforms after the split

between vertebrate and anthropods.

Evolution of human kinase pseudogenes

For 106 human kinase pseudogenes identified by Manning et al. (2002), 7

pseudogenes are excluded from further analysis due to very short sequences. Phylogenetic

analysis shows that most human kinase pseudogenes (75/99) are generated after the

mammalian radiation. Because of the apparent fast evolution caused by the loss of functional

constraint in the pseudogene lineage, we use some special treatments to date the duplication

events which gave rise to the current pair of functional (parent) kinase gene and pseudogene

(see Methods).

Page 76: Functional divergence and genome evolution of vertebrate protein

70

Generally, pseudogenes are believed to be generated through two different

mechanisms: via retrotransposition resulting in an intronless processed pseudogene or via

duplication of genomic DNA resulting in duplicated pseudogene. In 106 human kinase

pseudogenes, 75 lack introns such that they were likely to be generated by retrotransposition

of a processed transcript. It has been shown that a little more than half or even a much higher

eighty percentage of pseudogenes are processed (Harrison et al., 2002; Migghell et al., 2000),

so it is not surprising that the majority of human kinase pseudogenes are processed. It has

been proposed that processed pseudogenes, as well as both the Alu repeats (a major class of

SINEs, short interspersed elements) and LINE repeats (long interspersed elements), arose by

the protein machinery encoded by LINE! elements (Jurka, 1997; Esnault et al., 2000). The

age distribution of human kinase intronless pseudogenes showed that more than half of

intronless pseudogenes were generated very recently (< 30 mya) in a fairly constant rate (Fig.

5), while a considerable proportion of MARK intronless pseudogenes were generated after

human-mouse split.

20

15

10

" 5

I

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 > 80

molecular time scale (Myr ago)

Fig. 5. Age distribution of human processed pseudogenes.

Page 77: Functional divergence and genome evolution of vertebrate protein

71

Twenty-nine pseudogenes were identified that had no introns. They are more likely

to be generated through gene/genome duplication events, and most of them occurred on the

anthropoid lineage. Recent analysis of human genome sequence has revealed the fact that

about 5% of the human genome consists of interspersed segmental duplications that have

arisen over the past 35 million years (Eichler 2001, etc.). The age distribution of kinase

pseudogenes obviously resembles the pattern of recent segmental/tandem duplication (data

not shown). At this rate, we estimated that 478 x 5% ~ 24 kinase pseudogenes were

generated through segmental/small-scale duplications. According to these observations, we

propose a scenario for recent human protein kinase evolution: within 35 million years,

segmental duplications (small-scale mode, Gu et al. 2002) provided a continuous flux for

generating duplicate kinase genes, which evolved into protein kinase pseudogenes. For

example, two p70 ribosomal protein S6 kinase (p70S6K) pseudogenes appears to be part of a

large duplicon containing multiple duplicated genes, suggesting an intrachromosomal

duplication of the p70S6K locus (Manning et al, 2002). Apparently, these duplicates finally

become nonfunctional (pseudogene) because there seems no need for new protein kinases. In

this regards, one may conclude that the protein kinase-related signal transduction pathway

evolves slowly after the mammalian radiation, while regulatory motif changes may become

the major force for functional/expressional divergence. This hypothesis should be tested by

further multi-genome analysis of protein kinases.

In 106 human kinase pseudogenes, 27 are found to have matching ESTs. The age

distribution of pseudogene duplications generated these "expressed" pseudogene appears

random and uniform, indicating no obvious correlation between the age of pseudogene

duplication and gene transcriptional regulation change. There are many possible

explanations: (/) the recent produced pseudogene is still under the dynamic processing, and

has not lost all its transcriptional capability; (ii) the mutations accumulated that resulted in

functional loss were likely to happened in the coding region, but not in transcription

regulatory region; (iii) misidentified ESTs for kinase pseudogene.

Fates of duplicate genes and the gene-loss hypothesis

Gene duplication has generally been viewed as a necessary source for the origin of

novel functionality. In addition to ignite gene function innovation, gene duplication may also

Page 78: Functional divergence and genome evolution of vertebrate protein

72

provide a redundant copy such that it will lose the original function. Because the majority of

mutations are deleterious and gene duplications are generally assumed to be functionally

redundant at the time of origin, the usual fate of a duplicate-gene pair is the

nonfunctionalization of one copy (Lynch and Conery, 2000). Recently, a less-is-morc

hypothesis has been proposed to emphasize the importance of loss-of-function mutations on

the recently evolved novel lineage such as the human (Olson and Varti, 2003). In several

cases, genetic loss in human lineage has been shown to cause some differences between

human and chimpanzee.

As reported in Manning et al. (2002), three pseudogenes have no obvious parent

kinase in human but have functional orthologs in rodents. Our re-examine showed that

parent kinase of polo-like kinase SGF384ps has been identified (EMBL Accession:

AK054808) in human. But for KSGC (kinase-like domain containing soluble guanylyl

cyclase) and CGD, no functional human orthologous kinase has been identified (Figure 6).

Phylogenetic analysis shows that they both belong to the retinal guanylyl cyclase family

(GTP pyrophosphate-lyase). These two pseudogenes both contain introns and conserved

canonical splice sites. Moreover, matching ESTs have been found. We inferred these two

genes lost their functions recently such that not too many mutations have been accumulated.

In rat, CGD is specifically expressed in a small, randomly dispersed subpopulation of rat

olfactory sensory neurons and plays a role in interacting with a restricted group of odors

since its expression pattern resembles the pattern of expression of the diverse seven trans-

membrane-domain odorant receptors (Fulle et al., 1995; Julifs et al., 1997). KSGC found in

rat kidney cells appear to contain the kinase-like, dimerization and catalytic domains but

containing no ligand-binding or transmembrane domains, indicating that KSGC is a

cytoplasm!cally localized GC (Kojima et al., 1995). The specified function and unique

domain structure of CGD and KSGC suggested that the decay of previous functional genes in

the human lineage after its split from rodents. This hypothesis could be further tested when

genome sequences from other species such as chimpanzee are available.

Page 79: Functional divergence and genome evolution of vertebrate protein

73

80

100 ,

- H U M A N

D O G

B O V I N E

| R A T

1 0 0 C O U S E

C H I C K E N

G U C Y 2 E

- C G D p s

100 10 0

100

M O U S E

- R A T

- H U M A N

Q U C Y 2 D ( C G D )

c • B O V I N E

R A T

G U C Y 2 F

7 3 9 7

• M E D A K A F I S H O L G C - 5

M E D A K A F I S H O L G C - 3

- M E D A K A F I S H O L G C - 5

5 7

9 5

100 7 9

— H U M A N

G U I N E A P I G

• B O V I N E

- P I G

— R A T

• F R O G

Q U C Y 2 C

9 9

100

• M E D A K A F I S H O L G C - 6

• E E L

60 9 9

H U M A N

| RAT I M O U S E

F R O G

B U L L F R O G

A N P A

9 9

10 0 7 1

100 100

- H U M A N

- B O V I N E

— R A T

- E E L

D O G F I S H

A N P W

• F R O G

• M E D A K A F I S H O L G C - 2

• E E L

• M E D A K A F I S H O L G C - 7

K S G C p s

100

M O U S E

- R A T

K S G C

0.1

Fig. 6. Molecular phylogenetic analysis of retinal guanylyl cyclase. Two human kinase

pseudogenes CGDps and KSGCps are indicated.

Page 80: Functional divergence and genome evolution of vertebrate protein

74

Discussion

The recently catalogued complete set of human protein kinases (kinome) allows us to

test an interesting hypothesis on the evolution of protein kinases at the genome level: (i) The

major kinase-related animal specific signal-transduction pathways have been generated

through domain shuffling during the time period from the early stage of eukaryotes to the

course of metazoan evolution; (ii) vertebrate tissue-specificity of signal-transduction is

f a c i l i t a t e d b y l a r g e - s c a l e d u p l i c a t i o n e v e n t ( s ) i n t h e e a r l y s t a g e o f v e r t e b r a t e s , a n d ( i i i )

continuous small-scale duplication events and retrotranspositions occurred recently resulted

in the nonfunctional kinase pseudogenes.

Unlike the continuous mode of retrotransposition of kinase processed pseudogenes,

the distribution of ribosomal protein (RP) pseudogenes shows a peak at an evolutionary age

corresponding to 8% - 10% sequence divergence, whereas Alus peaks at 7% and LINE1 peak

at both 4% and 21% (Zhang et al., 2002). Considering the fast evolution rate of pseudogene

(3.9 x 10 * substitution per year) and the relative slow evolution rate of functional gene (1.5 x

10"9 substitution per year) (Li, 1997), one percentage of divergence between pseudogene and

functional gene was converted into approximately 2.1 Myr. Thus the peak of generation of

processed RP pseudogenes appeared at 15-20 Mya and the rate of new processed RP

pseudogenes generated has slow down after that. The overall activity of transposons has

declined markedly over the past 35-50 Myr, with the possible exception of LINE1 (Nature

human genome, 2001). Interestingly, both LINEs and processed pseudogenes are more

prevalent in relatively GC-poor regions, while Alus are predominantly found in GC-rich

regions (Pavlicek et al., 2001 ; Zhang et al., 2002). Noticeable, processed pseudogenes arc

not expressed, however in very rare case, transcripts of some pseudogene have been reported

though the functional relevance of these pseudogene transcripts remains unclear. We

observed seventeen intronless pseudogenes having matching EST, one possible explanation

of this unusual high rate of transcriptional processed pseudogenes is the high sequence

similarity between processed pseudogene and ESTs (both contain no intron).

Acknowledgement: We are grateful to Yufeng Wang for constructive comments. This study

was supported by NIH grant ROI GM62118 (to X. G).

Page 81: Functional divergence and genome evolution of vertebrate protein

75

References

Bailey, J. A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M. D.,

Myers, E.W., Li, P.W. & Eichier, E. E. (2002) Recent segmental duplications in the human

genome. Science 297: 1003-1007.

Doolittle, R.F. (1995) The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287-

314.

Eichier, E. E. (2001 ) Recent duplication, domain accretion and the dynamic mutation of the

human genome. Trends in Genetics 17: 661-669.

Esnault, CD., Maestre, J. & Heidmann, T. (2000) Human LINE retrotransposons generate

processed pseudogenes. Nat. Genet. 24: 363-367.

Fulle, Vassar, R., Foster, D. C., Yang, R., Axel, R. & Garbers, D. L. (1995) A receptor

guanylyl cyclase expressed specifically in olfactory sensory neurons. Proc. Natl. Acad. Sci.

USA 92: 3571-3575.

Gu, X., Wang, Y. & Gu, J. (2002) Age distribution of human gene families shows significant

roles of both large- and small-scale duplications in vertebrate evolution. Nature Genetics 31:

205-209.

Hanks, S. K., Quinn, A. M. & Hunter, T. (1988) The protein kinase family: conserved

features and deduced phytogeny of the catalytic domains. Science 241: 42-52.

Hanks, S. K. & Hunter, T. (1995) The eukaryotic protein kinase superfamily: kinase

(catalytic) domain structure and classification. FASEB J. 9: 576-596.

Harrison, P. M., Hegyi, H, Balasubramanian, S., Luscombe, N. M., Bertone, P., Echols, N„

Johnson, T. & Gerstein, M. (2002) Molecular fossils in the human genome: identification and

analysis of the pseudogenes in chromosomes 21 and 22. Genome Research 12:272-280.

Hunter, T. (1995) Protein kinases and phosphatases: the yin and yang of protein

phosphorylation and signaling. Cell 80: 225-236.

Juiifs, D. M., Fulle, H.-J., Zhao, A., Houslay, M. D., Garbers, D. L. & Beavo, J. A. (1997) A

subset of olfactory neurons that selectively express cGMP-stimulated phosphodiesterase

(PDE2) and guanylyl cyclase-D define a unique olfactory signal transduction pathway. Proc.

Natl. Acad. Sci. USA 94: 3388-3395.

Jurka, J. (1997) Sequence patterns indicate an enzymatic involvement in integration of

mammalian retroposons. Proc. Natl. Acad. Sci. 94: 1872-1877.

Page 82: Functional divergence and genome evolution of vertebrate protein

76

Kojima, M., Hisaki, K., Matsuo, H. & Kangawa, K. (1995) A new type soluble guanylyl

cyclase, which contains a kinase-like domain: its structure and expression. Biochem.

Biophys. Res. Commun. 217: 993-1000.

Krupa, A. & Srinivasan, N. (2002) The repertoire of protein kinases encoded in the draft

version of the human genome: atypical variations and uncommon domain combinations.

Genome Biol. 3(12): Research0066.1-0066.14.

Krylov, D. M., Nasmyth, K. & Koonin, E. V. (2003) Evolution of Eukaryotic Cell Cycle

Regulation: Stepwise Addition of Regulatory Kinases and Late Advent of the CDKs. Current

Biology 13: 173-177.

Lander, E. S. et al. (2001) The international human genome sequencing consortium. Initial

sequencing and analysis of the human genome. Nature 409: 869-921.

Leonard, C. J., Aravind, L. & Koonin, E. V. (1998) Novel families of putative protein

kinases in bacteria and archaea: evolution of the "eukaryotic" protein kinase superfamily.

Genome Research 8: 1038-1047.

Li, W. H. (1997) Molecular Evolution, Sinauer Associates, Inc., Sunderland, MA.

Lynch, M. & Conery, J. S. (2000) The evolutionary fate and consequences of duplicate

genes. Science 290: 1151-1155.

Manning, G., Whyte, D. B., Martinez, R., Hunter, T. & Sudarsanam, S. (2002) The protein

kinase complement of the human genome. Science 298: 1912-1934.

Manning, G., Plowman, G. D., Hunter, T. & Sudarsanam, S. (2002) Evolution of protein

kinase signaling from yeast to man. Trends Biochem Sci. 27: 514-520.

McLysaght, A., Hokamp, K. & Wolfe, K. H. (2002) Extensive genomic duplication during

early chordate evolution. Nature Genetics 31: 200-204.

Miyata, T. & Suga, H. (2001) Divergence pattern of animal gene families and relationship

with the Cambrian explosion. BioEssays 23: 1018-1027.

Ohno, S. (1970) Evolution by gene duplication. New York: Springer Verlag.

Olson, M. V. & Varki, A. (2003) Sequencing the chimpanzee genome: insights into human

evolution and disease. Nat Rev Genet. 4: 20-28.

Pavlicek, A., Jabbari, K., Paces, J., Paces, V., Hejnar, J. & Bernard), G. (2001) Similar

integration but different stability of Alus and LINEs in the human genome. Gene 276: 39-45.

Page 83: Functional divergence and genome evolution of vertebrate protein

77

Samonte, R. V. & Eichier, E. E. (2002) Segmental duplications and the evolution of the

primate genome. Nat Rev Genet. 3: 65-72.

S mit, A. F. (1999) Interspersed repeats and other mementos of transposable elements in

mammalian genomes. Curr Opin Genet Dev. 9: 657-663.

Suga, H., Katoh, K. & Miyata, T. (2001) Sponge homologs of vertebrate protein tyrosine

kinases and frequent domain shufflings in the early evolution of animals before the parazoan-

eumetazoan split. Gene 280: 195-201.

Zhang, Z., Harrison, P. & Gerstein, M. (2002) Identification and analysis of over 2000

ribosomal protein pseudogenes in the human genome. Genome Research 12: 1466-1482.

Page 84: Functional divergence and genome evolution of vertebrate protein

78

CHAPTER VI: GENERAL CONCLUSION

This study is focused on the exploitation of the functional divergence and

evolutionary patterns in vertebrate kinomes and kinase-regulated signaling transduction,

using a combinatorial statistical and evolutionary approach. After a general introduction in

Chapter I, the major findings are reported in Chapter II to Chapter V.

Summary of Findings

In the first study (Chapter II), we investigated the natural history and evolutionary

patterns in the protein tyrosine kinase (PTK) supergene family. PTK is the key mediator in

cellular signaling in metazoans, and is directly associated with a variety of human diseases.

Our analysis showed that site-specific shifted evolutionary rate (altered functional

constraint) is a common pattern in PTK gene family evolution. Our results are summarized

as follows: (1) Both large-scale and small-scale gene duplications are the major evolutionary

force for generating the contemporary PTK superfamily. (2) Substantial functional

divergence occurred after gene duplication(s), characterized by the significant shift in

evolutionary rates. (3) Evolutionary functional divergence is correlated with the phenotypic

functional divergence in paralogous genes. These results not only shed light on the role of

gene duplication in the development of hierarchical PTK-mediated networks, but also

provide impetus for a new approach to predict functionary divergence from evolutionary

changes.

In the research presented in Chapter III, we conducted a statistical and phylogenetic

analysis of the Jak (Janus kinase) gene family. In addition to a typical kinase domain (JH1),

the Jak family possesses a unique noncatalytic pseudokinase domain (JH2). This dual kinase

domain structure provides an ideal model for testing the impact of the (internal) domain

duplication on functional divergence. The impact of gene duplication, which may have given

rise to four tissue-specific member genes, Jakl, Jak2, Jak3 and Tyk2 in the early stages of

vertebrates, is of particular interest in this study as well. Our result shows that both domain

and gene duplications are important for Jak gene family proliferation and evolution: (1)

Shifted selective constraints (or shifted evolutionary rates) are statistically significant

between JH1 and JH2. (2) Predicted amino acid sites responsible for difference between JH1

Page 85: Functional divergence and genome evolution of vertebrate protein

79

and JH2 by posterior analysis can be classified into two groups: very conserved in JH1 but

highly variable in JH2, and vice versa. JH2 domain appears to have acquired some new

functions that may account for its long-term existence during evolution, since a group of sites

conserved in the JH2 domain are variable in the JH1 domain. (3) In the early vertebrate

lineage, two rounds of gene duplications have generated four tissue-specific vertebrate

isoforms: after the (first) gene duplication, site-specific rate shifts between Jak2/Jak3and

Jakl/Tyk are significant. This divergence took place mainly in the JH1, but not in the JH2

domain, indicating that the JH1 domain probably plays a relatively important role in

functional specification among tissue-specific isoforms.

In Chapter IV, we have explored the functional divergence in major pathway

components of kinase-mediated TGF-|3 signaling pathways from an evolutionary perspective.

Members in gene families involved in this complicated pathway are usually conserved in

sequence but distinct in temporal and tissue-specific expression. Our major findings include:

(1) The altered functional constraint is a common pattern in both ligand and receptor

evolution in TGF-(3 signaling; (2) Gene duplication might be an evolutionary force for

functional divergence; (3) The early vertebrate stage might be an important time span for the

development of tissue specificity and signaling pathways; and (4) The correlation between

structural and functional divergence seems to be evident.

After the analysis of an individual kinase gene family (Jak), protein tyrosine kinase

superfamily, and a kinase mediated signaling transduction pathway (TGF-(3), we have

attempted to globally explore the functional di vergence in the whole set of vertebrate kinases

(kinome) (Chapter V). The age distribution of kinase gene families showed that (I) The

major kinase-related animal specific si gna I -transduction pathways have been generated

through an ancient continuous domain shufflings (or duplications) during the time period

from early stage of eukaryotes to metazoan evolution; (2) Vertebrate tissue-specificity of

signal-transduction is facilitated by large-scale duplication event(s) in the early stage of

vertebrates; and (3) The kinase pseudogenes are generated through either segmental

duplication or retrotransposition very recently. These findings are essentially congruent with

our earlier report that both large scale and small scale gene duplications have significant

contribution during vertebrate evolution.

Page 86: Functional divergence and genome evolution of vertebrate protein

80

Ongoing and Future work

My long-term research goal is to integrate multiplayer genome information to study

evolutionary/comparative genomics and computational biology. The accomplished

researches reported here represent my effort to systematically study the evolutionary

mechanisms in complex biosystems. using kinase (kinome) as a paradigm. In the short term,

I hope to develop research programs to:

• Exploit the temporal-spatial expression profile of kinome, with a focus on the

statistical tests on publicly available microarray data (Gu and Gu 2003).

• Further investigate the kinome proliferation on functional innovations from tissue-

specificity to molecular pathways, and to reveal the joint distribution of age and

chromosome location of duplication genes.

• Extend the study of functional divergence to other important families in kinase

signaling network. The target families include transcription factors and G-protein

coupled receptors.

Page 87: Functional divergence and genome evolution of vertebrate protein

81

ACKNOWLEDGEMENTS

A journey is easier when you travel together. Interdependence is certainly more

valuable than independence. This thesis is the result of fives years of work whereby 1 have

been accompanied and supported by many people. I am glad to have the opportunity to

express my gratitude for all of them.

I am deeply indebted to my major advisor. Dr. Xun Gu, whose insight, stimulating

suggestions and encouragement helped me in all the time of research of this thesis. His

enthusiasm and integral view on science has made a deep impression on me. I would also

like to thank other committee members Dr. Dan Nettleton, Dr. Linda Ambrosio, Dr. Gavin

Nay I or and Dr. Daniel Voytas for their support during my Ph.D., especially to Dr. Dan

Nettleton for his guidance and advice during my study for M.S in statistics.

My research work is also strongly supported by the members of Computational

Genomics and Evolution Lab. I want to thank them for all their help, support, interest and

valuable hints. Especially I am obliged to Dr.Yufeng Wang. Our discussions of coursework,

research and life-in-general were always a welcome addition.

I would like to give my special thanks to my family. They have contributed so much

to my life that I could never acknowledge it all in this limited space.

Page 88: Functional divergence and genome evolution of vertebrate protein

82

VITA

NAME OF AUTHOR: Jianying Gu

DEGREES AWARDED: B.S. in Microbiology, Fudan University, 1995 M.S. in Molecular Genetics, Shanghai Institute of Plant Physiology, 1998 M.S. in Statistics, Iowa State University, 2002

HORNORS AND AWARDS: James Cornette Research Excellence Award in Bioinformatics, Iowa State University, 2002

PROFESSIONAL PUBLICATIONS: Gu J, Gu X (2003) Natural history and functional divergence in protein tyrosine kinases.

Gene. In press. Gu J, Gu X (2003) Induced gene expression in human brain after the split from

chimpanzee. Trends in Generics. 19(2): 63-65. Gu J, Wang Y, Gu X (2002) Evolutionary perspectives for functional divergence of Jak

protein kinase domains and tissue-specific genes. J. Mol. Evol. 54: 725-733. Gu J, Gu X (2002) Evolutionary analysis of functional divergence in TGF-J3 signaling

pathway. Information Sciences. 145 (3-4): 195-204. Gu J, Gu X (2002) Functional divergence in TGF-(3 signaling pathway. In: Proceedings

of the 6th Joint Conference on Information Science. Pp. 1268-1273. Gu X, Wang Y, Gu J (2002) Age-distribution of human gene families showing significant

roles of both large and small-scale duplications in vertebrate evolution. Nature genetics. 31: 205-209.

Gu X, Wang Y, Gu J, Vander Velden K (2002) Evolutionary perspective for functional divergence of gene family and applications in functional genomics. Current Genomics. 3:201-211.

Gu J, Wang Y, Gu X (2001) Statistical testing for protein functional divergence after major evolutionary events. In: Proceedings of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology. Pp. 176-180

Wang Y, Gu J, Gu X (2001) Patterns of functional divergence after gene duplications in vertebrate gene families: altered functional constraints. In: Proceedings of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology. Pp. 181-185

Gu J, Yu G, Zhu J, Shen S (2000) The N-terminal domain of NifA determines the temperature sensitivity of NifA in Klebsiella pneumoniae and Enterobacter cloacae. Science m CMna (Series C) 43(1):8-15.