detection and prediction of alternative splicing in...

Int. J. Computational Biology and Drug Design, Vol. 1, No. 1, 2008 39

Copyright © 2008 Inderscience Enterprises Ltd.

Detection and prediction of alternative splicing in Arabidopsis thaliana

M. Park* Department of Computer Science, University of Massachusetts, Lowell, MA, 01854, USA E-mail: [email protected] *Corresponding author

D.L. Falcone Department of Biological Sciences, University of Massachusetts, Lowell, MA, 01854, USA E-mail: [email protected]

K.M. Daniels Department of Computer Science, University of Massachusetts, Lowell, MA, 01854, USA E-mail: [email protected]

Abstract: Alternative splicing is an important process for increasing the diversity arising from a single gene. Presently, most studies aimed at detecting alternatively spliced genes use Expressed Sequence Tags (ESTs). However, the EST studies based on spliced transcripts analyse sequences by alignment rather than sequence patterns. Second, EST libraries can be of uncertain quality. To address these issues and to improve the quality of detection and prediction for alternative splicing, we propose a method that primarily uses pre-mRNAs. It is achieved by a decision tree algorithm using triplet nucleotides as attributes for each chromosome in Arabidopsis thaliana. In addition, we propose a novel algorithm for accurate prediction.

Keywords: alternative splicing; pre-mRNA; triplet nucleotides; detection and prediction; Arabidopsis thaliana.

Reference to this paper should be made as follows: Park, M., Falcone, D.L. and Daniels, K.M. (2008) ‘Detection and prediction of alternative splicing in Arabidopsis thaliana’, Int. J. Computational Biology and Drug Design, Vol. 1, No. 1, pp.39–58.

Biographical notes: Minseo Park is a ScD candidate in the Department of Computer Science at the University of Massachusetts, Lowell. She has been researching in bioinformatics.

40 M. Park et al.

Deane L. Falcone is an Assistant Professor in the Department of Biological Sciences at the University of Massachusetts, Lowell. He has been researching environmental stress of plants and manipulation of plant metabolic pathways for general plant improvement.

Karen M. Daniels is an Associate Professor in the Department of Computer Science at the University of Massachusetts, Lowell. She has been researching in bioinformatics and data mining.

1 Introduction

Alternative splicing occurs when converting pre-mRNA into mature mRNA. It is one of the most important processes for gene expression since it can provide various functional proteins from one pre-mRNA (Chuang et al., 2004; Sorek et al., 2004; Stamm et al., 2006). Aberrations in alternative splicing can also lead to mutations and diseases (NCBI).

Alternative splicing is classified into two types: one in which the protein coding region of a gene is eliminated or retained and the other in which the non-protein coding region is eliminated or retained. In the protein coding region, alternative splicing regulates protein functions, while the non-protein coding region can modulate gene expression levels and regulate mRNA stability (Haas et al., 2003; Sorek et al., 2004). This paper focuses on alternative splicing in acceptor/donor sites in the protein coding region since the majority of alternatively spliced genes are processed this way, and it significantly affects the encoding nucleotides in the flowering plant Arabidopsis thaliana (Campbell et al., 2006) studied in this paper.

In general, alternative splicing is classified into four patterns defined as ‘Alternative Acceptor Site’, ‘Cassette Exon’, ‘Alternative Donor Site’, and ‘Retained Intron’ (Nurtdinov et al., 2003). We chose Arabidopsis thaliana as the model organism since it has already been sequenced and published on The Arabidopsis Information Resource (TAIR) database. In Arabidopsis thaliana, The Institute for Genomic Research (TIGR) classifies alternative patterns as ‘Alternative Donors and/or Acceptors’, ‘Alternate Terminal Exons’, ‘Exon Skipping’, ‘Unspliced Introns’, and ‘Transcripts Beginning or Ending within Introns’. This paper concentrates on the ‘Alternative Acceptor Site’ and ‘Alternative Donor Site’ since nearly half of alternative splicing occurs in those areas in Arabidopsis thaliana (Black, 2003; Haas et al., 2003).

Even though many methods have been developed to detect or predict alternative splicing, these methods have not been able to completely accomplish their objectives using only genomic DNA sequences (Sorek et al., 2004; Stamm et al., 2006). Hence, most studies on detection of alternative splicing use ESTs. These studies have strengths in that splicing sites can be detected easily and conveniently. However, there are issues such as ESTs that are low in abundance and detection problems in areas that do not have ESTs. Therefore, we propose a method to detect or predict alternative splicing that does not directly depend on ESTs. The proposed method considers pre-mRNA within acceptor/donor sites in Arabidopsis thaliana. We also propose a new algorithm for prediction accuracy to compensate for the overfitting problem in a decision tree and to foster accuracy in prediction.

Detection and prediction of alternative splicing in Arabidopsis thaliana 41

For the purpose of training and testing the method we propose, we considered the three databases, TAIR, TIGR, and Alternative Splicing in Plants (ASIP), plus two unlabelled experimental data from our laboratory. The result of this study shows that alternatively spliced and normally spliced sites carry different patterns according to triplet nucleotides in each chromosome.

In Section 2, we introduce previously used alternative splicing approaches, an EST approach and a microarray approach, and discuss classification methods in data mining and conserved consensus for triplet nucleotides in acceptor/donor sites. Section 3 describes the proposed method to classify genes within pre-mRNA as alternatively spliced or normally spliced and describes a new algorithm for prediction accuracy. Section 4 shows the results. Finally, Section 5 draws conclusions and suggests directions for future research. This paper is an extended version of Park et al. (2007).

2 Related work

2.1 EST approach

Several alternative splicing studies use ESTs (Haas et al., 2003; Iida et al., 2004; Pertea et al., 2001; Wang and Brendel, 2006; Zhu et al., 2003). ESTs are small subsequences of DNA derived from mRNA (NCBI). They are used as markers of transcription of a particular gene and therefore can serve to indicate whether a gene is expressed or not, and can be used to establish intron-exon boundaries. ESTs are obtained by isolation of total mRNA then ‘reverse-transcribed’ to DNA, which are then sequenced. They do not contain intron sequences since they are derived from mRNA. Thus, we can detect mRNA through the ESTs that are continuous and connected to each other after locating and assembling them.

Use of ESTs in alternative splicing studies proceeds through two processes: gap-patching and redundancy-removing (Chuang et al., 2004). The gap-patching process is to correct gaps and mismatches between genomic sequences and ESTs or between ESTs and ESTs. Redundancy-removing filters out redundant ESTs. After analysis of entropy of ESTs, overlapped ESTs in the same region are reduced by reduction criteria, which are alternatives for ‘Alternative Acceptor Site’, ‘Cassette Exon’, ‘Alternative Donor Site’, and ‘Retained Intron’ patterns (Nurtdinov et al., 2003).

We can find ESTs rapidly and inexpensively by computer scanning (NCBI), and access them by scanning EST libraries. However, reliance on ESTs in alternative splicing studies has limitations. If no ESTs are present for some genes, then we cannot know whether alternative splicing exists or not within those genes (Kan et al., 2001). In addition, ESTs in most EST libraries are of relatively low quality (Iseli et al., 1999; Jongeneel, 2000; Wang and Brendel, 2006). There can be redundant and many erroneous sequences since EST libraries are often not clearly organised and annotated (Collins et al., 2003; Jongeneel, 2000; Sorek et al., 2004). Some libraries use a combination of ESTs and experimental validation to improve the quality of data (e.g., TIGR web-based library for alternative splicing of Arabidopsis thaliana (Pertea et al., 2001). Even though ESTs are useful for studies of alternative splicing, it is quite difficult to distinguish between true alternative splicing and aberrant splicing due to the reasons noted above (Chuang et al., 2004). The GenBank database somewhat improves the contaminant problems of ESTs, but there are still numerous errors and the issues of quality ESTs

42 M. Park et al.

remain (Jongeneel, 2000). Therefore, the EST approach can have limitations for detection of spliced sites (Collins et al., 2003; Nurtdinov et al., 2003).

2.2 cDNA microarray approach

A cDNA microarray is based on a base-pairing principle that allows probes to hybridise to cDNA’s fixed to a solid support (the microarray). The basic process of a cDNA microarray is as follows (Raghunandan et al., 2003):

• Extraction: RNA (target) is extracted from living cells.

• Labelling: The target is multiplied and labelled by incorporating fluorescence dyes.

• Hybridisation: Targets are added to the microarray to base-pair with the cDNA probes in the microarray. Targets that are hybridised to cDNA probes of the microarray are detected by their fluorescence.

• Scanning: A microarray scanner detects the fluorescence and inputs to a computer. The position of fluorescent probes identifies the genes that were expressed in the living cell.

Scientists also use an oligonucleotide microarray to detect alternative splicing. The AffymetrixGeneChipTM is the most widely used oligonucleotide microarray (Mehta, 2002). It works similarly to a cDNA microarray. However, it uses several short oligonucleotide sequences as probes unlike cDNA microarrays that use a long strand of DNA as a fixed probe. Thus, in an oligonucleotide array, the equivalent of a whole genome can be put on a single microarray. Signals arising from the chip can indicate gene expression levels and detect splice variants (Hu et al., 2001).

Presently, this approach is efficient to detect short splice products. Oligonucleotide microarrays mostly operate in the 3’ region of a gene because of limitations of probe labelling (Hu et al., 2001). To analyse alternative splicing across a whole gene by a microarray, probes should include a greater length of the transcript. Another limitation of the microarray approach is data size. The more data we use, the more splice variants are detected. However, because of the high cost of a microarray experiment large data sets are not as available. Accuracy in microarray analysis depends on consistency and hybridisation of data. Thus, analysis should be repeated to obtain accurate data (Hu et al., 2001; Mehta, 2002). Even though many laboratories use microarrays in their research, the results are often variable. Therefore, once genes are identified through microarray analysis, we must validate the list before deciding to use the data (Mehta, 2002).

2.3 Classification methods in data mining

Several classification approaches have been used for biological data, such as: ‘Nearest Centroid Classifier’, ‘Nearest-Neighbor Classifier’, ‘Discriminant Analysis’, ‘Support Vector Machine’, and ‘Decision Tree’ (Garrett-Mayer and Parmigiani, 2005). ‘Nearest Centroid Classifier’ computes a centroid of sample data in each class and then assigns new sample data to the class whose centroid is the nearest (Garrett-Mayer and Parmigiani, 2005). ‘Nearest-Neighbor Classifier’ is a method that finds the closest data from sample data sets being classified, and assigns new data to the same class as its neighbouring data (Cover and Hart, 1967). ‘Discriminant Analysis’ is a method that


classifies a data set into predefined classes with each having a different classification function (Fisher, 1936). New data is assigned to the class that produces the highest value for the data among all classification functions. A ‘Support Vector Machine’ classifies data using a separating geometric hyperplane in a high-dimensional feature space (Vapnik, 1998). Finally, a ‘Decision Tree’ recursively partitions data into subsets (Breiman et al., 1984), and its result is graphically represented (Garrett-Mayer and Parmigiani (2005)).

‘Nearest Centroid Classifier’, ‘Nearest-Neighbor Classifier’, ‘Discriminant Analysis’ and ‘Support Vector Machines’ are used for geometric data. Even though ‘Discriminant Analysis’ can be used for non-geometrical data, a different classification function is required for each class. We selected a decision tree approach because it is well suited to our problem, particularly because our data is non-geometric.

A decision tree is created through two processes: splitting and pruning. Splitting is the process which classifies data into subtrees using decision criteria. The criteria are defined by attributes. Internal nodes are labelled with attributes, a branch represents an outcome of a criterion, and leaves are labelled with data classification (refer to Figure 1). Pruning is the process of cutting branches.

Figure 1 Decision tree data sets are split by criteria (see online version for colours)

A decision tree has some weaknesses. It may not classify exactly since there can be missing data and overfitting problems (Witten and Frank, 2000; Duda et al., 2001). Overfitting means “fitting too much the training data and the model starts to degrade performance on test data” (Nabhan and Rafea, 2005). The problem occurs when noisy or erroneous data are added to a data set (refer to Figure 2). In this case, the decision tree branches can be unnecessarily long. To solve the problem, a pruning method can be used. In spite of these weaknesses, a decision tree can be useful to classify data (Brand and Gerritsen, 1988; Delisle, 2005; Friedl et al., 1999). Since a decision tree represents the outcome graphically using a graph or tree model (Garrett-Mayer and Parmigiani, 2005), it is easy to use and effective for identifying patterns or relationships of large data sets. Furthermore, since a decision tree can be applied to all data types, it can be used to analyse the structure of complicated proteins and gene expression data in biological studies (Burge and Karlin, 1997; Zhang and Yu, 2002). For splice site prediction, Pertea et al. uses Markov Modeling and a binary decision tree based on non-redundant protein and EST databases in Arabidopsis thaliana and humans. A decision tree can also be used to diagnose medical diseases, and can be applied in non-medical contexts.

44 M. Park et al.

Figure 2 Two data cause overfitting (see online version for colours)

2.4 Conserved consensus in acceptor/donor sites of Arabidopsis thaliana

Conserved consensus sequences, which represent the frequency results of sequences in genome sequences, can provide good evidence to predict the exon-intron boundary in the splice sites (Bonizzoni et al., 2005), and further be a functional motif in biological sciences. Actually, sites spliced incorrectly by weak consensus can lead to complicated junctions of splice sites. Therefore, the consensus analysis contributes to recognising the boundary between exon and intron and can help correct ambiguous sequence alignments (Bonizzoni et al., 2005). For example, Bonizzoni et al. used both sequence consensus and EST alignments to predict the exon-intron structure of a gene. We analyse and detect consensus rates for triplet nucleotides in acceptor/donor sites based on published literature. Almost all introns proceed from ‘gt’ and end at ‘ag’ (Mount et al., 2001; Pertea et al., 2001). In particular, the exon of acceptor sites starts with ‘CAG’, ‘UAG’. In general, splice site selection has similarity at the last three exon nucleotides and the first seven intron nucleotides in the donor sites and has the conserved nucleotides which are from the last four intron nucleotides to one exon nucleotide in the acceptor sites (Mount et al., 2001). In addition, Pertea et al. has considered 16 nucleotides and 29 nucleotides from the boundary between exon and intron as donor sites and acceptor sites in Arabidopsis thaliana. Weak consensus occurs at the branch point, 30 nucleotides upstream from the acceptor sites (Pertea et al., 2001).


3 Detection and prediction of alternative splicing using pre-mRNA

This section describes our proposed method to classify genes within pre-mRNA as alternatively spliced or normally spliced. Section 3.1 introduces the dominant characteristics of our proposed method. Section 3.2 provides a procedure of our method that uses triplet nucleotides for the detection of alternative splicing in pre-mRNA of Arabidopsis thaliana. Key aspects of our decision tree approach are presented in Section 3.3. An algorithm for accurate prediction is presented in Section 3.4. Section 3.5 discusses data collection.

3.1 Characteristics of the proposed method

The objective is to improve the quality of detection and prediction for alternative splicing and normal splicing on pre-mRNA. To meet our objective, we employ the following:

• choose Arabidopsis thaliana as the model organism

• primarily use pre-mRNA instead of ESTs as observational data

• use decision trees as the classification method

• propose a novel algorithm for accurate prediction.

We chose Arabidopsis thaliana as the eukaryotic model organism. Its genome has been completely sequenced, is heavily annotated and is available on the TAIR web site, making gene sequences and their annotations easily accessible. Another reason for focusing on Arabidopsis thaliana is that alternative splicing studies in plants are not as thoroughly investigated as they have been in mammalian systems (Wang and Brendel, 2006). Furthermore, since genomes of plants and mammalian systems are similar, the results of a plant study might yield insights into mammalian systems. The Arabidopsis thaliana genome consists of five chromosomes. We considered 483 genes encoded among the five chromosomes with the exception of special cases as training data (see Section 3.5).

Secondly, we do not use microarrays or directly use ESTs due to the disadvantages mentioned in Sections 2.1 and 2.2. To solve these problems, we propose a method that uses pre-mRNA.

Thirdly, of all the existing classification methods, we select a decision tree as the most appropriate for our needs. Since a decision tree can be a very effective tool for prediction and classification of data (Twa et al., 2003), it helps obtain the best results in deciding whether genes carry alternative splicing or normal splicing signals on the basis of pre-mRNA.

Finally, we propose a new algorithm for accurate prediction. It is achieved by considering both a decision tree level and 20 triplet nucleotides’ consensus positions, which are ten positions grouped by triplet nucleotides prior to and after the boundary between exon and intron inside one splicing site. With the algorithm, we are able to judge how accurate the predicted result is, which then helps in the accurate detection and prediction of whether or not a gene carries alternatively spliced sites.

46 M. Park et al.

3.2 Procedure for detection and prediction

In this section, we provide a high-level overview of how patterns for alternative splicing and normal splicing are obtained for each chromosome in Arabidopsis thaliana.

• Detection of pre-mRNA and acceptor/donor sites. Pre-mRNA sequences are obtained from TAIR. We can detect acceptor/donor sites by recognising the borders between exons and introns in each pre-mRNA. We ignore the first start position of the first exon and the stop position of the terminal exon. This is because alternative splicing occurs mostly at the boundary between introns and exons. The first start position lacks introns on its left-hand (5’) side, but has a 5’ UTR (Untranslated Region). The stop position also lacks introns on its right-hand (3’) side, but contains a 3’ UTR.

• Length of classification region. Based on the conserved consensus mentioned in Section 2.4, we use 30 nucleotides prior to and after the boundary between each intron and exon.

• Conversion of pre-mRNA to triplet nucleotides. Codons (triplet nucleotides) specify amino acids and are the basis by which the exons within a mRNA sequence are converted to the amino acid sequence of protein during translation. Thus, in the 30-base region of each exon, we convert each triplet nucleotide into a codon and choose amino acid labels as exon attributes. In the intronic region, we experiment with groups of bases of variable size: one, two, three, and four bases. Another experiment ignores intronic regions. For notational ease we refer to each intron base group as a triplet nucleotide. Thus, we use triplet nucleotides in the analysis of both the exons and introns.

• Analysis of acceptor/donor sites with triplet nucleotides. This study assumes that each chromosome has a different pattern in normal splicing and alternative splicing within acceptor/donor sites. After investigating codons converted by grouped nucleotides in each chromosome, we determined that each chromosome consists of different triplet nucleotide. Thus, a different decision tree is made for each chromosome.

• Detection of patterns. Alternative splicing and normal splicing patterns are deduced through the rules from branches of each decision tree. Classification of each sequence as normally or alternatively spliced is based on TIGR outputs.

• Prediction or validation of each splicing. Using patterns from the rules in a decision tree, we can validate known splicing sites or predict unknown splicing sites.

• Evaluation for accuracy of prediction. With a new proposed algorithm based on the triplet nucleotide consensus and a decision tree level to evaluate accuracy of prediction, we can evaluate and judge more accurately whether spliced sites are alternatively or normally spliced.

3.3 Separate-and-conquer binary decision trees

A decision tree needs attributes upon which to split pre-mRNA into alternatively spliced and normally spliced cases. We propose two classification criteria with attributes


(triplet nucleotides) for acceptor/donor sites in Arabidopsis thaliana that are applicable to all of its chromosomes. The criteria are as follows:

• Classification Criterion 1 (CC1) Do acceptor/donor sites have some triplet nucleotides that only exist in normal splicing?

• Classification Criterion 2 (CC2) Do acceptor/donor sites have some triplet nucleotides that only exist in alternative splicing?

We apply these criteria sequentially (and sometimes alternatingly) in each of our pre-mRNA binary decision trees (see Section 4.2).

We also propose using the ‘separate-and-conquer binary decision tree’. This is a specialisation of the divide-and-conquer decision tree because separate-and-conquer rules are applied. A separate-and-conquer rule identifies a classifier covering many instances in the class, and separates the covered instances (Witten and Frank, 2000). In the separate-and-conquer decision tree, the ‘separate’ process leads to one leaf in each step and then separates the leaf from a decision tree. Therefore, the separate-and-conquer decision tree can increase the efficiency of classification.

In our separate-and-conquer binary decision tree, splicing sites satisfied with one classification criterion are separated and filtered out from the original acceptor/donor sites. The tree structure assists in finding a target value faster because the amount of data that is considered in subsequent steps is reduced. The decision tree process repeats on the remaining acceptor/donor sites until most of the pre-mRNA are classified as alternative or normal splicing. The goal is to classify data well while avoiding overfitting.

3.4 Evaluating prediction accuracy

This section describes a new algorithm developed to judge the accuracy of prediction results. The objective is to evaluate accuracy of prediction for splice sites with consensus of triplet nucleotide positions and a score according to a decision tree level. The twenty consensus positions for triplet nucleotides in each acceptor/donor site are used together with the decision tree level, which is associated with classification of spliced sites. That is why we consider ten triplet nucleotides prior to and after the boundary between exons and introns and use the separate-and-conquer binary decision tree algorithm as a classification method.

The lower the level in a decision tree is associated with detection, the higher the chance of encountering an error. Thus, we assign low values to acceptor/donor sites that are classified in the low decision tree levels. Through this method, we can compensate for a weak point such as an overfitting problem in the decision tree.

Values for triplet nucleotide positions inside one acceptor/donor site are decided with the consensus rates (see Section 2.4) of triplet nucleotides since conserved consensus can become further evidence for splice junctions (Bonizzoni et al., 2005). The consensus rates are based on published work (Pertea et al., 2001; Mount et al., 2001).

Consider ‘Prediction Accuracy (P)’ in acceptor/donor sites. Let Z be the set of values which are assigned according to the consensus (refer to Figures 3 and 4). The consensus is divided into four levels based on the published literature (Pertea et al., 2001; Mount et al., 2001), so that

48 M. Park et al.

1 2 3 43 2 11, , ,4 4 4

Z Z Z Z Z = = = = =

(1)

where Z1 indicates the strongest consensus and Z4 indicates the weakest consensus. Let W be the set of values associated with a decision tree level in each chromosome,

so that

1 2 3 11 2 11, , , , t

t tW W W W Wt t t−− − = = = = =

… (2)

where t indicates the number of total levels in each chromosome’s decision tree. It can be different according to a chromosome since each chromosome can have different decision tree levels (see Section 4.2).

Let f(x) be a function for triplet nucleotide consensus in intronic regions in one acceptor site. Let x(1 ≤ x ≤ 10) indicate the triplet nucleotide position inside intronic regions with a triplet nucleotide separated from the decision tree, so that

4

3

1

6 10, ( )2 5, ( )

1, ( ) .

If x then f x Zelse if x then f x Zelse if x then f x Z

≤ ≤ =≤ ≤ == =

(3)

Let g(y) be a function for triplet nucleotide consensus in exon regions of one acceptor site. Let y(1 ≤ y ≤ 10) indicate the triplet nucleotide position inside exon regions with a triplet nucleotide separated from the decision tree, so that

4

2

1

6 10, ( )2 5, ( )

1, ( ) .

If y then g y Zelse if y then g y Zelse if y then g y Z

≤ ≤ =≤ ≤ == =

(4)

The prediction accuracy (P) is as follows:

_ _k k intron k exonP P P= + (5)

at 2 ≤ k ≤ the number of exons in one gene, where 10

_1

( , )k intron nx

P W f x n=

= ∑ (6)

10

_1

( , )k exon ny

P W g y n=

= ∑ (7)

where f(x, n) is f(x) at the decision tree level n, g(y, n) is also g(y) at the decision tree level n, Wn is the value inside the set W at the decision tree level n, k indicates the order of acceptor sites inside one gene, and n indicates the tree level where a particular triplet nucleotide is separated from a decision tree.

The Pk can be applied to splice sites of all chromosomes by only changing t in equation (2). However, in donor sites, k is changed to (1 ≤ k ≤ (the number of exons in one gene – 1)) in equation (5) since k indicates the order of donor sites. Also, f(x) and g(x) should also be applied based on equations (8) and (9).


4

2

1

4 10, ( )2 3, ( )

1, ( )

If x then f x Zelse if x then f x Zelse if x then f x Z

≤ ≤ =≤ ≤ == =

(8)

4

3

2

4 10, ( )2 3, ( )

1, ( ) .

If y then g y Zelse if y then g y Zelse if y then g y Z

≤ ≤ =≤ ≤ == =

(9)

The higher the value of P, the higher the chance of predicting accurately spliced sites. Therefore, P can help predict accurately whether spliced sites are normally or alternatively spliced within acceptor/donor sites.

Figure 3 A consensus inside one acceptor site. The red colour indicates ‘High conserved consensus’, light red indicates ‘Conserved consensus’, green indicates ‘Consensus’, and purple indicates ‘Weak consensus’. This figure illustrates Z in Formula (3) and (4) (see online version for colours)

Figure 4 A consensus inside one donor site. The red colour indicates ‘High conserved consensus’, light red indicates ‘Conserved consensus’, green indicates ‘Consensus’, and purple indicates ‘Weak consensus’. This figure illustrates Z in Formula (8) and (9) (see online version for colours)

3.5 Data collection

The Arabidopsis thaliana genomic DNA database used for evaluating our method was constructed by TAIR. The web database provides genomic and molecular biological data for Arabidopsis thaliana. Published gene sequences were used for training and testing our algorithm. We also collected alternative splicing information for Arabidopsis thaliana from the TIGR web site. Although the TIGR data are based on ESTs, they were created using a combination of methods in order to achieve a high degree of confidence in assigning alternatively spliced genes. The site lists 547 genes’ alternative splicing data in its ‘Alternative Donors and/or Acceptors’ section. The data were used as observational data for building our decision trees. Table 1 shows the number of genes and acceptor/donor sites considered in our decision tree algorithms. We exclude some genes having special cases to avoid overfitting. The four special cases are as follows. First, some genes have only one exon. In the second case, the first exon is comprised of too

50 M. Park et al.

few, only consisting of ATG or Agt. In case three, genomic DNA sequences have been translated into UTR in the middle of the last or the first exon. The fourth case arises when it is difficult to decide whether acceptor/donor sites represent alternatively spliced or normally spliced sites since genome sequences in TAIR and TIGR are different.

Table 1 The number of Arabidopsis thaliana alternatively spliced genes published in TIGR employed in this study. The data is used as training data in our study

Chromosome Genes Used genes Used acceptor sites Used donor sites

At1g 119 107 838 838 At2g 99 86 533 533 At3g 96 87 612 612 At4g 94 85 557 557 At5g 139 118 931 945

In addition, we evaluate acceptor/donor sites published in the ASIP database (refer to Table 2). The ASIP has published many new sequences including all TIGR genes with the exception of 55 genes (Wang and Brendel, 2006). We use them as testing data with the TIGR data. Two experimental genes (At1g27450 and At1g30460) in our laboratory that are unlabelled in other systems are also used for validating prediction.

Table 2 The number of Arabidopsis thaliana alternatively spliced genes published in ASIP. The data is used as testing data in our study

Chromosome Tested genes for

acceptor sites Tested genes for

donor sites Tested acceptor

sites Tested donor

sites

At1g 269 145 1963 1062

At2g 201 85 1488 659

At3g 134 69 1001 812

At4g 176 68 1213 485

At5g 265 107 1937 758

4 Results

Basing our decision trees on classification criteria CC1 and CC2 for triplet nucleotides interpreted as amino acids produced very positive results. We found that the best size of an intron nucleotide group is three; this is discussed in Section 4.1. Our decision trees are described in Section 4.2. Our new method validates the data that have already been labelled as alternatively spliced or normally spliced sites on the ‘Alternative Donors and/or Acceptors’ section of TIGR and ASIP; this is described in Section 4.3. Also, to evaluate the prediction of splicing sites, we ran tests for experimental data obtained from our laboratory; see Section 4.4. Finally, in order to assess the accuracy of prediction for splice sites, we applied our new algorithm mentioned in Section 3.4.


4.1 Nucleotide grouping in intronic regions

In order to define the attributes that influence alternative splicing in the intron sequences, we first fixed triplet nucleotides in the exon and then ran tests with nucleotide groups of sizes one through four in the intronic region. Also, we experimented without considering introns. The results for one, two, and four group sizes were not as satisfactory as triplet intron grouping. Furthermore, the results of the experiment without considering introns were similar to outputs of one-based groups. Thus, we conclude that a triplet nucleotide is also an important discriminant to detect alternatively spliced sites in the intronic region. Exon and intron regions by triplet nucleotide are numbered 1 through 10, with 1 being closest to the border between an exon and an intron.

4.2 Decision trees

When building our decision trees, we tried several methods. First, we built a decision tree based on nucleotides and consensus in pre-mRNA without distinguishing among chromosomes. Next, we tried to build a different decision tree with the previous method for each chromosome. However, we did not obtain good results. Therefore, a different method was attempted. We picked triplet nucleotides as attributes, repeatedly applied criteria CC1 and CC2 (see Section 3.3) to all intron/exon positions of acceptor/donor sites, and built a binary decision tree for each chromosome. Chromosome At1g’s decision tree had an acceptably small number of levels and leaves: 6 levels/7 leaves in acceptor sites. The decision trees for the other chromosomes were also not excessively high and did not have many leaves for the acceptor/donor sites (refer to Table 3 and Figure 5). Figure 5 shows chromosome At2g’s decision tree. The first application of CC1 identified 426 normally spliced cases using amino acid codons such as threonine at intron position two (6 nucleotides upstream of the intron/exon border). The first application of CC2 which is the criterion of the second level of the decision tree revealed 37 alternatively spliced cases with amino acid codons such as lysine at exon position two. The second application of CC1, which is the third decision tree level, showed 61 normally spliced cases using amino acid codons such as lysine at intron position one. The sequence of CC1, CC2 and CC1 was sufficient to classify all the data for chromosome At2g.

Figure 5 Chromosome At2g decision tree has three levels in acceptor sites (see online version for colours)

52 M. Park et al.

Table 3 The number of decision tree levels and leaves for each chromosome for acceptor and donor sites in Arabidopsis thaliana

Chromosome (levels/leaves) At1g At2g At3g At4g At5g

Acceptor site 6/7 3/4 3/4 5/6 11/12 Donor site 2/3 2/3 2/3 1/2 2/3

The acceptor site found in chromosome At5g is a particularly challenging case. Criteria CC1 and CC2 failed after the third and sixth tree levels. Thus, we applied a modified criterion at those two levels consisting of three steps:

1 Selection. We selected the codon position that contained the most frequent amino acid codon in all pre-mRNA after the application of CC1 and CC2 at previous decision tree levels. This approach was inspired by the idea of using consensus in (Pertea et al., 2001).

2 Filtering. We filtered observation data based on the most frequent amino acid codon in the position selected in step 1.

3 Comparison. After filtering, alternatively and normally spliced sites were compared based on amino acid codon types at position k0, chosen as follows. Let m(i, k) = the number of amino acid codon types considered in the criterion at level i ≥ 2 for position k, and

0 [ 10, 10]

( , ) ( , ) ( 1, )arg max ( , ).

k intron exon

h i k m i k m i kk h i k

=

= + −=

Due to the two problematic levels, the chromosome At5g decision tree has more levels (11 levels) than the other four chromosomes’ decision trees. However, sensitivity analysis indicated that we should not use a pruning method (see Section 2.3) for reducing the height of chromosome At5g’s decision tree. In addition, the result for chromosome At5g is appropriate since more splice sites exist than for the other chromosomes.

As we know, the higher the height of a binary tree is, the higher the chance of introducing an error. However, the error probability is close to zero in chromosome At5g’s decision tree of acceptor sites (see Section 4.3). Therefore, the results show that our attributes and criteria are well-defined in the decision trees.

4.3 Detection of Splice Sites on the Arabidopsis thaliana genome

The accuracy of our patterns was computed on all five chromosomes used in our decision trees. In order to validate our proposed separate-and-conquer binary decision tree algorithms, we compared our results with labelled data in TIGR and ASIP (see Section 3.5). We then computed the ‘Correct Classification Rate’ to measure classification accuracy by TIGR database and ASIP database outputs for our patterns. The correct classification indicates the ratio of incorrectly predicted or classified splice sites among total splice sites. For the comparison with the output of training data TIGR, the correct classification showed 100%, with the exception of donor sites for chromosome At2g, with produced 99.8% correct classification. For the ASIP testing output, acceptor sites produced around 85% for the correct classification: chromosome


At1g: 86.6%; chromosome At2g: 85.2%; chromosome At3g: 84.8%; chromosome At4g: 83.8%; chromosome At5g: 87.0%. Donor sites resulted in around 90%: chromosome At1g: 87.4%; chromosome At2g: 94.4%; chromosome At3g: 90.8%; chromosome At4g: 90.1%; chromosome At5g: 89.2%.

We also validated our method with blended testing. The blended testing is proceeded as follows: we divided data into t groups (t folds) according to the gene number label in each chromosome. The t – 1 partitions are reserved for training data (TIGR), and testing data (ASIP) are allotted to the remainder in turn. The procedure repeats t times in turn, and then the results are averaged. For example, the acceptor sites of chromosome At4g have four folds by the gene number label. Thus, training data (TIGR) are reserved for three folds and testing data (ASIP) are allotted to one fold in turn. The procedure is repeated four times and the classification rates are then averaged (refer to Figure 6). The four-fold blended testing for chromosome At4g has 93.4% correct classification rate.

The other chromosomes’ results are shown in Table 4. Therefore, the results show that our method is very positive.

Figure 6 Blended testing for chromosome At4g acceptor sites repeated four times. The number inside ( ) indicates the number of acceptor sites (see online version for colours)

Table 4 The correct classification rate for the blended testing based on TIGR and ASIP data

Chromosome No. of folds Correct classification rate (%) Acceptor Nine-fold 96.8 At1g Donor Nine-fold 98.4 Acceptor Five-fold 93.6 At2g Donor Five-fold 97.3 Acceptor Seven-fold 95.9 At3g Donor Seven-fold 98.3 Acceptor Four-fold 93.4 At4g Donor Four-fold 97.8 Acceptor Seven-fold 96.7 At5g Donor Seven-fold 98.7

4.4 Prediction of splice sites in Arabidopsis thaliana

The method we propose can predict whether genes that are unlabelled on other systems have alternatively spliced sites or not. It can also detect genomic DNA nucleotides that are associated with alternative splicing. Here we discuss experimental genes ‘At1g27450’ and ‘At1g30460’ from our laboratory.

54 M. Park et al.

The acceptor site in the genetic sequence of ‘At1g27450’ has one alternatively spliced site in the 4th exon predicted by the patterns from our decision tree algorithms. The triplet nucleotide ‘tgg’, positioned 12 to 10 bases away in the backward direction from the border between exon and intron, specifies tryptophan by the genetic code. The tryptophan codon can be associated with alternative splicing (refer to Figure 7). Another triplet nucleotide, ‘aat’, positioned 28–30 bases away in the backward direction from the border between exon and intron (position intron 10) in the 2nd exon, is classified as alternative splicing by experimental results in our laboratory, yet it is classified as normal splicing by our decision tree (refer to Figure 7 and Section 3.2). This discrepancy can be explained by the fact that the intron 10 position is in an area of weak consensus. ‘At1g27450’ has no alternatively spliced sites in donor sites. Another experimentally examined gene, ‘At1g30460’, has no alternatively spliced sites in any acceptor site. All acceptor sites of the gene are classified as normally spliced sites with both the patterns of our decision trees and biological experiments. The results for donor sites also correspond to the experimental results from our laboratory. ‘At1g30460’ indicates alternative splicing in the donor site of the 2nd exon (refer to Figure 8). It is detected by three patterns (refer to Figure 9). These patterns are considered in the first level of chromosome At1g’s decision tree with CC1.

Figure 7 The triplet nucleotides ‘aat’ and ‘tgg’ can bring about alternative splicing in the acceptor site of the 2nd exon and the 4th exon in the At1g 27450 gene, respectively. The red colour indicates UTR, yellow indicates exon, and purple indicates intron (see online version for colours)

Figure 8 The triplet nucleotides ‘GCC’, ‘AGG’, ‘gag’, and ‘ggt’ are detected by three patterns. The yellow colour indicates exon, purple indicates intron, and black indicates intergenic region of gene At1g30460 (see online version for colours)


Figure 9 The three patterns for alternative splicing in donor sites of gene ‘At1g30460’ of chromosome At1g. Triplet Nucleotides A, R, E, and G are abbreviated by the genetic code (see online version for colours)

4.5 Evaluation of classification accuracy

In order to evaluate the accuracy of prediction for the splice sites that we annotate, here we discuss the result (see Section 4.4) for one acceptor site of gene ‘At1g27450’ which shows a discrepancy with experimental data in our laboratory and the results (see Section 3.4) that are different from ASIP output. With the score for ‘Prediction Accuracy (P)’ mentioned in Section 3.4, we evaluate the accuracy of prediction. Clearly, a high score for P can lead to more exact prediction. In the second exon’s acceptor site of our experimental gene ‘At1g27450’, an alternatively spliced site is detected in the tenth triplet nucleotide position of intronic regions and at the fifth level in the chromosome At1g’s decision tree. Thus, it leads to a very small score, 0.083, for P. Other results from a comparison with ASIP output for two factors associated with P score appear in Tables 5 and 6. As we mentioned in Section 3.4, the closer to the boundary between exon and intron and the root node of a binary decision tree, the higher is P. Through the P score, we can predict and judge more accurately whether it is alternatively or normally spliced, and then molecular biologists can reduce time and more precisely perform experiments since genes can be predicted as having alternatively spliced sites prior to experiments.

Table 5 The results for consensus positions. The percentages indicate the ratio of the number of splice sites detected in each consensus position among total splice sites unmatched with ASIP output. Consensus is defined in Sections 2.4 and 3.4

Chromosome High conserved consensus (%)

Conserved consensus (%)

Consensus (%)

Weak consensus (%)

Acceptor 0.4 11.6 50.9 37.1 At1g Donor 0.0 45.9 16.5 37.6 Acceptor 0.5 15.1 63.0 21.0 At2g Donor 0.0 50.0 21.6 28.4 Acceptor 0.5 2.8 65.7 28.6 At3g Donor 0.0 64.4 48.9 51.1 Acceptor 0.0 8.4 50.8 39.3 At4g Donor 0.0 72.9 20.8 6.3 Acceptor 0.4 9.7 49.6 40.3 At5g Donor 0.0 43.9 23.2 32.9

56 M. Park et al.

Table 6 The results for decision tree levels. The percentages indicate the ratio of the number of splice sites detected in a low level decision tree among total splice sites unmatched with ASIP output. In donor sites, a decision tree level cannot be an important factor for the classification since all their trees are small (refer to Table 3)

Chromosome Low decision tree levels with the exception of first and

second levels (%)

At1g Acceptor 33.5 At2g Acceptor 33.8 At3g Acceptor 5.2 At4g Acceptor 44.5 At5g Acceptor 48.4

5 Conclusion and future work

In the study, we designed separate-and-conquer binary decision trees based on pre-mRNA of Arabidopsis thaliana for the analysis of splice sites and created a novel algorithm for accurate prediction by combining two factors, which are triplet nucleotide consensus and decision tree levels. The method we propose is validated through the comparison with TIGR and ASIP outputs and unlabelled experimental genes in our laboratory. In particular, the method we propose has significant advantages by not using ESTs directly. The method can therefore compensate for the issue of EST quality. Also, it avoids the problems of detection of rare EST transcripts due to low expression levels and cells where ESTs are not available. Previous studies did not include detailed analysis on the position and outcome of alternative splicing (Wang and Brendel, 2006). However, the method we propose produces the positions associated with alternatively or normally spliced sites. Furthermore, using a separate-and-conquer binary decision tree algorithm, pre-mRNA sequences can be classified simply as alternatively and normally spliced sites without high computing cost. Finally, this method can predict splicing sites even for unknown genes or unpublished splice sites in other organisms.

In the future, this method can be applied to detect and predict alternative splicing not only in acceptor/donor sites but also for other alternative splicing patterns. If the patterns of splicing sites from our decision trees are more accurately identified and supported through laboratory experiments, the method may be applied to the study of human genes since plant and mammalian genomes have many common features. The method might therefore be useful in predicting and treating human diseases.

Acknowledgements

Thanks to Dr. Kil-Young Yun for helpful discussions on general biological sciences.


References Black, D. (2003) ‘Mechanisms of alternative pre-messengerRNA splicing’, Annual Review of

Biochemistry Journal, Vol. 72, pp.291–336. Bonizzoni, P., Rizzi, R. and Pesole, G. (2005) ‘ASPIC: a novel method to predict the exon-intron

structure of a gene that is optimally compatible to a set of transcript sequences’, BMC Genomics, Vol. 6, p.244.

Brand, E. and Gerritsen, R. (1988) ‘Decision trees’, Journal DBMS Online Journal, available at http://www.dbmsmag.com/9807m05.html

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees, Wadsworth International Group, Belmont, CA.

Burge, C. and Karlin, S. (1997) ‘Prediction of complete gene structures in human genomic DNA’, Journal of Molecular Biology, Vol. 268, pp.78–94.

Campbell, M.A., Haas, B.J., Hamilton, J.P., Mount, S.M. and Buell, C.R. (2006) ‘Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis’, BMC Genomics, Vol. 7, p.327.

Chuang, T.J., Chen, F.C. and Chou, M.Y. (2004) ‘A comparative method for identification of gene structures and alternatively spliced variant’, Bioinformatics, Vol. 20, pp.3064–3079.

Collins, J.E., Goward, M.E., Cole, C.G., Smink, L.J., Huckle, E.J., Knowles, S., Bye, J.M., Beare, D.M. and Dunham, I. (2003) ‘Reevaluating human gene annotation: a second-generation analysis of chromosome 22’, Genome Research, Vol. 13, pp.27–36.

Cover, T.M. and Hart, P. (1967) ‘Nearest neighbor pattern classification’, Proceedings of IEEE Transaction on Information Theory, pp.21–27.

Delisle, K. (2005) Decision Trees and Evolutionary Programming, Artificial Intelligence Depot., http://ai-depot.com/Tutorial/DecisionTrees.html

Duda, R.O., Peter, E.H. and Stork, D.G. (2001) Pattern Clssification, A Wiley-Interaction Publication John Wiley & Sons, Inc. New York, NY.

Fisher, R.A. (1936) ‘The use of multiple measurements in taxonomic problems’, Annals of Eugenics, Vol. 7, pp.178–188.

Friedl, M.A., Brodley, C.E. and Strahler, A.H. (1999) ‘Maximizing land cover classification accuracies produced by decision trees at continental to global scales’, Proceedings of IEEE Transaction on Geoscience and Remote Sensing, pp.969–977.

Garrett-Mayer, E. and Parmigiani, G. (2005) ‘Clustering and classification methods for gene expression data analysis’, in Nuber, U. (Ed.): In DNA Microarrays (Advanced Methods), Taylor and Francis Group, New York.

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr., R.K., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D., Salzberg, S. and White, O. (2003) ‘Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies’, Nucleic Acids Research, Vol. 31, pp.5654–5666.

Hu, G.K., Madore, S.J., Moldever, B., Jatkoe, T., Balaban, D., Thomas, J. and Want, Y. (2001) ‘Predicting splice variant from DNA chip expression data’, Genome Research, Vol. 11, pp.1237–1245.

Iida, K., Seki, M., Sakurai, T., Satou, M., Akiyama, K., Toyoda, T., Konagaya, A. and Shinozaki, K. (2004) ‘Genome-wide analysis of alternative premRNA splicing in Arabidopsis Thaliana based on full-length cDNA sequences’, Nucleic Acids Research, Vol. 32, pp.5096–5103.

Iseli, C., Jongeneel, V. and Bucher, P. (1999) ‘ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences’, Proceedings of the Seventh ISMB, pp.138–148.

Jongeneel, C.V. (2000) ‘Searching the expressed sequence tag (EST) databases: panning for genes’, Corporate Environmental Strategy, Vol. 8, No. 3, pp.217–222.

58 M. Park et al.

Kan, Z., Rouchka, E.C., Gish, W.R. and States, D.K. (2001) ‘Gene structure prediction and alternative splicing analysis using genomically aligned ESTs’, Genome Research, Vol. 11, pp.889–900.

Mehta, S. (2002) DNA Microarrays in Health Care & Drug Discovery, http://plasticdog.cheme. columbia.edu/

Mount, S.M., Salzberg, S. and Chang, C. (2001) ‘Pre-mRNA splicing signals in Arabidopsis’, NSF-01-162 Project Description, http://www.life.umd.edu/labs/mount/20-10-splicing/ Grant.pdf

Nabhan, A.R. and Rafea, A.A. (2005) ‘Tuning statistical machine translation parameters using perplexity’, Proceedings of the 2005 IEEE International Conference on Information Reuse and Integration, pp.338–343.

Nurtdinov, R.N., Artamonova, I.I., Mironov, A.A. and Gelfand, M.S. (2003) ‘Low conservation of alternative splicing patterns in the human and mouse genomes’, Human Molecular Genetic, Vol. 12, pp.1313–1320.

Park, M., Falcone, D.L., Yun, K. and Daniels, K.M. (2007) ‘Detection and prediction of alternative splicing within acceptor/donor sites in pre-mRNA of Arabidopsis thaliana’, Proceedings of IEEE 7th International Conference on BioInformatics and BioEngineering, pp.180–186.

Pertea, M., Lin, X. and Salzberg, S.L. (2001) ‘GeneSplicer: a new computational method for splice site prediction’, Nucleic Acids Research, Vol. 29, pp.1185–1190.

Raghunandan, D.S., Guglielmo, L., Darja, K. and Animesh, A.S. (2003) ‘Clinical applications of DNA microarray analysis’, Journal of Experimental Therapeutics and Oncology, Vol. 3, pp.297–304.

Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G. and Shamir, R. (2004) ‘A non-EST-based method for Exon-Skipping prediction’, Genome Research, Vol. 14, pp.1617–1623.

Stamm, S., Riethoven, J.J., Le Texier, V., Gopalakrishnan, C., Kumanduri, V., Tang, Y., Barbosa-Morais, N.L. and Thanaraj, T.A. (2006) ‘ASD: a bioinformatics resource on alternative splicing’, Nucleic Acids Research, Vol. 34, pp.D546–D555.

Twa, M.D., Parthasarathy, S., Raasch, T.W. and Bullimore, M.A. (2003) ‘Decision tree classification of spatial data patterns from VideokeratographyUsing Zernike polynomials’, Proceedings of the Third SIAM International Conference on Data Mining, pp.1–10.

Vapnik, V. (1998) Statistical Learning Theory, John Wiley & Sons, New York, NY. Wang, B.B. and Brendel, V. (2006) ‘Genomewide comparative analysis of alternative splicing in

plants’, Proceedings of the National Academy of Science of the United States of America, pp.7175–7180.

Witten, I.H. and Frank, E. (2000) Tree-based Analysis of Microarray Data for Classifying Breast Cancer, Academic Press, San Diego, CA.

Zhang, H.P. and Yu, C.Y. (2002) ‘Corporate social responsibility and the identification of stakeholders’, Frontiers in Bioscience, Vol. 7, pp.C63–C67.

Zhu,W., Schlueter, S.D. and Brendel, V. (2003) ‘Refined annotation of the Arabidopsis Thaliana genome by complete EST mapping’, Plant Physiology, Vol. 132, pp.469–484.

Websites Alternative Splicing in Plants, http://www.plantgdb.org/ASIP/ National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov The Arabidopsis Information Resource, http://www.arabidopsis.org The Institute for Genomic Research, http://www.tigr.org/

detection and prediction of alternative splicing in...

Documents