a markov classification model for metabolic pathways

RESEARCH Open Access

A markov classification model for metabolicpathwaysTimothy Hancock*, Hiroshi Mamitsuka

Abstract

Background: This paper considers the problem of identifying pathways through metabolic networks that relate toa specific biological response. Our proposed model, HME3M, first identifies frequently traversed network pathsusing a Markov mixture model. Then by employing a hierarchical mixture of experts, separate classifiers are builtusing information specific to each path and combined into an ensemble prediction for the response.

Results: We compared the performance of HME3M with logistic regression and support vector machines (SVM) forboth simulated pathways and on two metabolic networks, glycolysis and the pentose phosphate pathway forArabidopsis thaliana. We use AltGenExpress microarray data and focus on the pathway differences in thedevelopmental stages and stress responses of Arabidopsis. The results clearly show that HME3M outperformed thecomparison methods in the presence of increasing network complexity and pathway noise. Furthermore ananalysis of the paths identified by HME3M for each metabolic network confirmed known biological responses ofArabidopsis.

Conclusions: This paper clearly shows HME3M to be an accurate and robust method for classifying metabolicpathways. HME3M is shown to outperform all comparison methods and further is capable of identifying knownbiologically active pathways within microarray data.

BackgroundNetworks are a natural way of understanding complexprocesses involving interactions between many variables.Visualizing a process as a network allows the researcherto form an intuitive understanding of complex phenom-ena. A clear example of the effective use of networks isthe visualization of metabolic networks to provide adetailed map of key chemical reactions and their geneticdependencies that occur within a cell. However the sizeand complexity of metabolic networks has increased tothe point where the ability to understand the entire net-work is lost. Researchers must now rely on models ofthe network structure to capture the key functionalcomponents that relate to an observed response. In thispaper we propose a model capable of identifying the keypathways through metabolic networks that are related toa specific biological response.Metabolic networks, as described in databases such as

KEGG [1], can be represented as directed graphs, with

the vertices denoting the compounds and the edgeslabeled by the reactions. The reactions within metabolicnetworks are catalyzed by specific genes. If a gene isactive, then it is possible for the corresponding reactionto occur. If a reaction is active then a pathway is createdbetween two metabolic compounds that is labeled bythe gene that catalyzed the reaction. Information aboutthe activity of genes within metabolic networks can bereadily obtained from microarray experiments. Microar-ray experiments are then used to view differences ingene activity under varying experimental conditionssuch as (y = 1) patients treated with drug A and (y = 2)patients treated with drug B. The question asked bysuch experiments is: are there any gene pathways thatare differentially expressed when patients are given drugA or B? The abundance of publicly available microarrayexpression observations found in databases such asArrayExpress [2] along with the detailed biologicalknowledge contained within pathway databases likeKEGG, has spurred biologists to want to combine thesetwo sources of information and model the metabolic* Correspondence: [email protected]

Bioinformatics Center, Institute for Chemical Research, Kyoto University,Japan

Hancock and Mamitsuka Algorithms for Molecular Biology 2010, 5:10http://www.almob.org/content/5/1/10

© 2010 Hancock and Mamitsuka; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of theCreative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

mailto:[email protected]

http://creativecommons.org/licenses/by/2.0

network dynamics under different experimentalconditions.This paper proposes a novel classification model for

identifying frequently observed paths within a specifiednetwork structure that can be used to classify knownresponse classes. Our proposed model is a probabilisticcombination of a Markov mixture model which identi-fies frequently observed pathway clusters and an ensem-ble of supervised techniques each trained locally withineach pathway cluster to classify the response. Werequire the prior specification of the metabolic network,gene expression data and response variable that labelsthe experimental conditions of interest.To construct our model we consider the network to

be a directed graph and pathways through the networkto be binary strings. For example there are 4 possiblepaths between nodes A and D in the network describedin Figure 1. In Figure 1 the binary representation of thepath between A and D that traverses edges [1,3,4] is [1,0,1, 1, 0]. If we interpret Figure 1 to be a metabolic net-work where the edges are the genes and the nodes arethe compounds, then which paths are taken at any giventime can be seen to be dependent on the activity of spe-cific genes. If a gene is active, then it is possible to pro-ceed along that edge within the network. In ourexperiments we extract all valid pathways from eachmicroarray experiment that are observed between pre-specified start and end compounds. To do this we treateach microarray experiment, xi as a single observationof the activity of all genes within a network. For each xiwe also have a response label yi denoting the experi-mental conditions. Then defining an active edge to bean over-expressed gene observation within xi we extractall possible paths from the start node to the end nodeand label each path with yi. The resulting pathway data-set then consists of N observed paths from each micro-array experiment each with a response label indicatingthe observed experimental group. Common bioinfor-matics solutions to this problem include using datamining techniques to classify the response based on thegene expression information and then overlay the find-ing on the metabolic pathway [3]. Although thisapproach can classify the response accurately, they useno knowledge of the network structure. Network struc-tures can be incorporated into standard methods bydefining an appropriate similarity measure betweensequences and then employ a kernel technique, such asSupport Vector Machines (SVM) [4] to classify theresponse. However, the specification of a similarity mea-sure or kernel removes any ability to observe individualpathways and determine if the model identifies a

meaningful biological result. An accurate classifier withthe capability to extract the dominant pathways isrequired for a complete solution.Graphical methods such as Bayesian networks present

a framework capable of modeling a network structureimposed upon a dataset [5]. Bayesian networks searchfor the most likely network configuration by drawingedges connecting dependent variables. However, whenconsidering mining the dominant paths within a knownnetwork such an approach may not be the most directsolution. For example constructing a Bayesian networkof a metabolic pathway will join related genes by assum-ing a conditional dependence between each gene and itsparent genes within the network. This dependency isvalid when considering problems concerning the predic-tion of unknown structure [6,7] though may be inap-propriate for the prediction of frequently observed pathsthrough a known network structure. To predict fre-quently observed paths, a more natural assumption isaccommodated by Markov methods which assume thatthe decision on the next step taken along a path onlyrequires information on the current and next set ofgenes within the network.Hidden Markov Models (HMM) are commonly used

for identifying structure within sequence information[8]. HMMs assume that the nodes of the network areunknown and the observed sequences are a direct resultof transition between these hidden states. However, ifthe network structure is known, a more direct approachis available through a mixture of Markov chains. Markovmixture models such as 3M [9] directly search for domi-nant pathways within sequence data by assuming eachmixture component is a Markov chain through a knownnetwork structure. For metabolic networks, Markovmixture models, such as 3M, have been shown to pro-vide an accurate and highly interpretable model ofdominant pathways throughout a known network struc-ture. However, both HMM and 3M are unsupervisedmodels and therefore are not able to direct their searchto explicitly uncover pathways that relate to specificexperimental conditions.The creation of a supervised classification technique

that exploits the intuitive nature of Markov mixturemodels would be a powerful interpretable tool for biolo-gists to analyze network pathways. In this paper we pro-pose a supervised version of the 3M model using theHierarchical Mixture of Experts (HME) framework [10].We choose the mixture of experts framework as oursupervised model because it provides a complete prob-abilistic framework for localizing a classification modelto specific clusters within a dataset. Our proposed


of 9

model, called HME3M employs a HME to combine the3M with penalized logistic regressions classifiers as theexperts within each cluster to classify the response.

ExperimentsOur problem has the following inputs: the networkstructure, microarray observations and a response vari-able. A pathway through the network, xi, is assumed tobe a binary vector, where a 1 indicates a traversed edgeand 0 represents a non-traversed edge. The decision onwhich edges can be traversed is made for each microar-ray observation based on the expression of each gene.Once the set of valid edges have been defined, for eachmicroarray observation all valid pathways are extracted.After extracting all observed pathways we label eachpath with the response label of the original microarrayexperiment. Once this is completed for all observationsit is possible to set up a supervised classification pro-blem where the response vector y denotes the responselabel of each pathway, and the predictor matrix X is anN × P binary matrix of pathways, where N is the num-ber of pathways and P is the number of edges withinthe network. The binary predictor matrix, X and itsresponse y can now be directly analyzed by our pro-posed pathway classifier, HME3M, and also with stan-dard supervised techniques. We assess the performanceof HME3M in both simulated and real data environ-ments and compare it to PLR and Support VectorMachines (SVM) with three types of kernels, linear,polynomial (degree = 3) and radial basis. The implemen-tation of SVM used for these experiments is sourcedfrom the R package e1071 [11].We point out here that the predictor matrix X is a list

of all pathways through the network observed withinthe original dataset. Therefore X contains all availableinformation on the given network structure containedwithin the original dataset. Using this information asinput into the PLR and SVM models is supplying thesemethods with the same network information that is pro-vided to the HME3M model. As the supplied informa-tion is the same for all models the comparison is fair.The performance of the models are expected to differbecause SVM and PLR do not consider the Markov nat-ure of the input pathways whereas HME3M explicitlymodels this property with a first order Markov mixturemodel.Experiments comparing HME3M to standard classifi-

cation techniques are performed first on simulated net-work pathways and then on real metabolic pathwaysand microarray expression data. We now describe thedetails of each experiment.

Synthetic DataTo construct the simulation experiments we assumethat the dataset is comprised of dominant pathways thatdefine the groups and random noise pathways. Toensure that the pathway structure is the major informa-tion within the dataset, we specify the network structureand simulate only the binary pathway information. Adominant pathway is defined as a frequently observedpath within a response class. The level of expression ofa dominant pathway is defined to be the number oftimes it is observed within a group. A noise pathway isdefined to be a valid pathway within the network thatleads from the start to the end compounds but is notany of the specified dominant pathways. As the percentof noise increases, the relative expression of the domi-nant paths decreases, making correct classificationharder.We run the simulation experiments on three graphs

with the same structure but with increasing complexitiesas shown in Figure 2. For each network we define twodominant pathways for each response label, y = 0 and y= 1 and give each dominant pathway equal pathwayexpression levels. We simulate a total of 200 pathwaysper response label which includes observations from thetwo dominant pathways and noise pathways. Separatesimulations are then performed for the specified noisepathway percentages [10, 20, 30, 40, 50]. The perfor-mance of each method is evaluated with 10 runs of 10-fold cross-validation. The performance differencesbetween HME3M compared to SVM and PLR are thentested with paired sample t-tests using the test set per-formances from the cross-validation. We set theHME3M parameters to be M = [2,3], l = 1, a = 0.5.KEGG NetworksTo assess the performance of HME3M in a realistic weuse two different metabolic networks both extractedfrom KEGG [1] for the Arabidopsis thaliana plant. Thenetworks are selected for their differing structure andcomplexity. We deliberately use Arabidopsis as it hasbecome a benchmark organism and it is well knownthat during the developmental stages and under stressconditions, different components of core metabolicpathways are activated. The first is glycoloysis (Figure 3)which is a simple left to right style network and the sec-ond is the pentose phosphate pathway (Figure 4) whichis a simple directed cycle. Due to the large number ofpaths extracted for the KEGG networks to assess theperformance of HME3M we conduct 20-fold inversecross-validation for model sizes M = 2 to M = 10.Inverse 20-fold cross-validation firstly divides the obser-vations randomly into 20 groups and then for each


of 9

group trains using only observations from one groupand tests the performance on the observations from theother 19. The performance of HME3M for 20-foldinverse cross-validation is compared to PLR and theSVM models.KEGG Arabidopsis Glycolysis PathwayIn Figure 3 we extract from KEGG the core componentof the glycolysis network for Arabidopsis betweenC00668 (Alpha-D-Glucose) and C00022 (Pyruvate). Theextracted network in Figure 3 is a significantly morecomplex graph than our simulated designs and has103680 possible pathways between C00668 and C00022.We extract the gene expression observations for allgenes on this pathway from the AltGenExpress develop-ment series microarray expression data [12] downloadedfrom the ArrayExpress database [2]. The AltGenExpressdevelopment database [12] is a microarray expressionrecord of each stage within the growth cycle of Arabi-dopsis and contains expression observations of 22814genes over 79 replicated conditions. For our purposeswe extract observations for “rosette leaf” (n = 21) and“flower” (n = 15) and specify “flower” to be target class(y = 1) and “rosette leaf” to be the comparison class (y= 0). For the glycolysis experiment we set the HME3Mparameters to be: l = 1 and a = 0.7.To extract binary instances of the glycolysis pathway

within our extracted data we scale the observations tohave a mean of zero and standard deviation of 1. Afterscaling the expression denote active genes within thenetwork using three tolerances [-0.1, 0, 0.1] and con-struct three separate datasets. Within each dataset weset any gene expression observation that is above thespecified tolerance to be “1” or overexpressed, otherwisewe set its value to “0” or underexpressed. The structureof each pathway dataset is presented in Table 1. This isa simple discretization as it requires no additional infor-mation from the response or external conditions thatmight limit the number of paths selected. We deliber-ately choose this simple discretization of the geneexpressions as it provides a highly noisy scenario to testthe performance of HME3M.KEGG Arabidopsis Pentose Phosphate PathwayIn Figure 4 we extract from KEGG the core componentof the pentose phosphate network for Arabidopsisbetween C00668 (Alpha-D-Glucose) and C00118 (D-Gly-ceraldehyde 3-Phosphate). The extracted network ismore complex again than the glycolysis network andhas 1305924 possible pathways between C00668 andC00118. We extract the gene expression observationsfor all genes on this pathway from the AltGenExpressabiotic stress microarray expression data [13].

The AltGenExpress abiotic stress database [12] con-tains gene expression measurements on the responses ofthe “Shoots” or “Roots” of Arabidopsis to various stressstimuli. For our purposes we extract observations forArabidopsis “Shoots” in both the oxidative stress andcontrol groups for all observed times from 0.25 to 3hours. This results in six experiments from the “Oxida-tive” (n = 6) and 10 experiments from the “Control” (n= 10) and we specify “Oxidative” to be target class (y =1) and “Control” to be the comparison class (y = 0).We select this particular subset of the AltGenExpress

abiotic stress as observations on the metabolite abun-dance for the pentose phosphate pathway [14] clearlyshow that within the first 3 hours of exposure to oxida-tive stress a significant increase in the abundance ofC00117 (D-Ribose 5-phosphate) is observed. In [14] itwas suggested that this increase was a result of anincrease in the flux through the oxidative branch of thepentose phosphate pathway (Figure 4). In this paper wetry to confirm this observation within the AltGenEx-press abiotic stress with HME3M.To extract binary instances of the pentose phosphate

network within our extracted data we scale the observa-tions to have a mean of zero and standard deviation of1. After scaling the expression denote active geneswithin the network using three tolerances [0, 0.05, 0.1]and construct three separate datasets. The structure ofeach pathway dataset is presented in Table 2. We usedifferent tolerances to the glycolysis pathway experi-ments due to the excessively large number of pathwaysextracted for negative tolerance values Table 2. For thepentose phosphate experiment we set the HME3M para-meters to be: l = 2 and a = 1.

Results and DiscussionSynthetic DataFor the synthetic data the correct classification rate(CCR) percentages, ranges and paired sample t-testresults for simulated graphs are shown in Table 3. Allexperiments show HME3M outperforming the trialledSVM kernels and a single PLR model. In fact, the onlytimes when the performances of SVM and HME3M areequivalent (P-value < 0.05) is with the small or mediumgraph with high levels of within group noise. Of particu-lar note is the observation that for the medium andlarge graphs the median performance for HME3M isalways superior to SVM. Furthermore, as the graphcomplexity increases it is clearly seen that HME3M con-sistently outperforms SVM and this performance ismaintained despite the increases in the percent of noisepathways.


of 9

The performance of PLR for the simulated pathways isparticularly poor because the dataset is noisy and binary.PLR can only optimize on these noisy binary variablesand is supplied with no additional information such asthe kernels of the SVM models and the pathway infor-mation of HME3M. Additionally, the L2 ridge penalty isnot a severe regularization and will estimate coefficientsfor pure noise pathway edges. Combining the lack ofinformation within the raw binary variables with thenature of L2 regularization, it is clear in this case thatPLR will overfit and lead to poor performance.Table 3 also demonstrates that as you increase the

number of mixture components in the HME3M model,M, the model’s resistance to noise increases. Theincreased robustness of HME3M is observed in theincrease in median performance from M = 2 to M = 3when the noise levels are 30% or more (≥ 0.3). A sup-porting observation of particular note is that when theperformances of HME3M with M = 2 is compared withthe linear kernel SVM on the medium graph and 50%noise there is no significant difference between themodel’s performances. However, by increasing M to 3,HME3M is observed to significantly outperform linearkernel SVM. Further, in a similar but less significantcase, for the small graph with 50% added noise, byincreasing M from 2 to 3 the median performance ofHME3M becomes greater than that of linear kernelSVM. Although this increase did not prove to be signifi-cant the observed increasing trend within the medianperformance is clearly driving the results of the t-test.It is noticeable in Table 3 that the HME3M perfor-

mance can be less precise than SVM or PLR models.However the larger range of CCR performances is notlarge enough to affect the significance of the perfor-mance gains made by HME3M. The imprecision ofHME3M in this case is most likely due to the constantspecification of l, a and M over the course of the simu-lations. In the microarray data experiments we showthat careful choice of M produces stable model perfor-mances with a comparable CCR range than the nearestSVM competitor.KEGG Arabidopsis Glycolysis PathwayThe glycolysis experiment results are displayed in Figure5. Figure 5 presents the mean correct classification rates(CCR) for HME3M and comparison methods for eachpathway dataset built from the three trailed gene activitytolerances. The number of mixture components M isvaried from 2 to 10. It is clear from Figure 5 that for alltolerances the mean CCR for HME3M after M = 2 isconsistently greater than all other methods and the opti-mal performance being observed at M = 4. An

interesting feature of Figure 5 is that after the optimalperformance has been reached, the addition of morecomponents seems to not affect the overall classificationaccuracy. This shows HME3M to be resistant to overfit-ting and complements the results of the noise simula-tion experiments in Table 3.The ROC curves for each HME3M component are

presented in Figure 6 and clearly show that the thirdcomponent is the most important with an AUC of0.752, whereas the other three components seem tohold limited or no predictive power. A bar plot of theHME3M transition probabilities (θm) for the third (m =3) component is presented in Figure 7. Overlaying thetransition probabilities from Figure 7 onto the full net-work in Figure 3 it is found that for three transitionsonly single genes are required for the reaction to pro-ceed:

• C CAT G00111 001182 21180 • C C CAT G AT G00197 00631 000741 09780 1 74030

A further analysis of the genes identified reveals theinteraction between AT1G09780 (θ = 1) andAT1G74030 (θ = 0.969) is of particular importance instress response of Arabidopsis. A literature search onthese genes identified both AT1G09780 andAT1G74030 as important in the response of Arabidopsisto environmental stresses such as cold exposure, saltand osmotic stress [15,16]. However, AT2G21180, apartfrom being involved in glycolysis, has not previouslybeen found to be strongly involved in any specific biolo-gical function. Interestingly however, a search of TAIR[17] revealed that AT2G21180 is found to be expressedin the same growth and developmental stages as well asin the same plant structure categories as bothAT1G09780 and AT1G74030. These findings are indica-tive of a possible relationship between these three genesin particular in the response to environmental stress.The second path connecting compounds C00197

through C00631 to C00074 is found by HME3M tohave a high probability of being differently expressedwhen comparing glycolysis in flowers and rosette leaves.The branching of glycolysis at Glycerate-3P (C00197)through to Phosphoenol-Pyruvate (C00074) correspondsknown variants of the glycolysis pathway in Arabidopis;the glycolysis I pathway located in the cytosol and theglycolysis II pathway located in the plastids [17]. Thekey precursor that leads to the branching within cytosolvariant by the reactions to convert Beta-D-Fructose-6P(C05378) to Beta-D-Fructose-1,6P (C05378) usingdiphosphate rather than ATP [17]. Referencing the


of 9

included pathway genes in Figure 7 within the referenceArabidopsis database TAIR [17] we observe that thegenes specific to the percursor reactions for the cytosolvariant of glycolysis are included within the pathway, i.e.the genes [AT1G12000, AT1G20950, AT4G0404] forconverting beta-D-fructose-6P (C005345) into beta-D-fructose-1,6P2 (C005378) utilizing diphosphate ratherthan ATP. HME3M’s identification of the plant cytosolvariant of the glycolysis pathway confirms this pathwayas a flower specific, because the plastids variant is clearlymore specific to rosette leaves due to their role inphotosynthesis.KEGG Arabidopsis Pentose Phosphate PathwayThe classification performance rates for all methods toclassify oxidative stress and control pathways within thepentose phosphate pathway for each tolerance level arepresented in Figure 8. It is clearly observed from Figure8 for tolerance levels 0.05 and 0.1 HME3M is outper-forming all comparison models for all values of M.However for tolerance 0 we initially observe the polyno-mial and radial SVM kernels outperforming bothHME3M and linear SVM. However as M increases weobserve the performance of HME3M to steadily increaseand finally after M = 9 HME3M is slightly outperform-ing both radial and polynomial SVM. This performanceprofile is an indication of the degree of noise within thedataset. The number of pathways identified for a toler-ance of 0 is quite large, 63002 (Table 2), and decreasingslightly this tolerance level to -0.05 is seen to double thenumber of pathways extracted. Therefore it is reason-able to suggest that setting a tolerance of 0 is just at theedge of the pathway structure distribution below whichexcessive amounts of noise pathways are extracted.In contrast increasing the tolerance level to 0.1 we

observe a decrease in the performance of HME3M as Mis increased from M = 2 to M = 4 (Figure 8). Thisuncharacteristic drop in performance of HME3M is theresult of insufficient variation within the pathway dataset.This assertion is supported by HME3M finding the opti-mum model over all datasets at tolerance of 0.05. How-ever when the gene activity tolerance is increased to 0.1the optimal performance observed at a tolerance of 0.05is never reached. Therefore increasing the tolerance to0.1 is removing important pathways are required to pro-duce the optimal model. HME3M then attempts to com-pensate for this lack of variation within the pathwaysobserved at a tolerance of 0.1 by overfitting. This overfit-ting then leads to the decrease in performance observedas the model complexity of HME3M is increased.From Figure 9 we observe that the ROC curves for the

optimal HME3M model (M = 2 tolerance = 0.05) clearly

indicate one path for the oxidative label and anotherpath for the control label. An interesting property of theROC curves of each path is that the structure of m = 1is almost exactly opposite to m = 2. The cause of thisinverse similarity between the ROC curves is that asimilar path is identified by each 3M component (θm = 1

and θm = 2 are correlated at r = 0.52) for both m = 1and m = 2 but the signs of the PLR coefficients withineach expert are flipped. In Table 4 we show the distri-bution of signs of the PLR coefficients for each of thetwo components. From Table 4 we see that for all caseswhen bm = 1 < 0 there is a 45% chance that the sign ofthe PLR coefficent is positive in path m = 2. The highcorrelation between the estimated pathway structureindicates that the same path is being found for both m= 1 and m = 2. However the flipping of the signs withinthe PLR coefficients changes the structure of m = 1 topredict the control label when the oxidative path incomponent m = 2 is not observed. The pathway dupli-cation indicates that the main structure within the data-set is the activated oxidative pathway observed whenArabidopsis is under stress and the control group con-tains mainly noise pathways with little unique structure.To visualize the oxidative class pathway we overlay the

transition probabilities onto the pentose phosphate net-work (Figure 4) and clearly see the oxidative branchfrom C00668 to C00117 (D-Ribose-5P) is highlighted(Figure 10). The transition probabilities estimated byHME3M confirm the observations of [14] and show thatwhen Arabidopsis is under oxidative stress the pentosephostphate pathway is clearly coordinated to produceD-Ribose-5P. However we observe that no single genetransitions can define the pathway but a coordinated setof genes that determine the path taken when the pen-tose phosphate cycle is subjected to oxidative stress.

ConclusionsIn this paper we have presented a novel approach for thedetection of dominant pathways within a network struc-ture for binary classification using the Markov mixture ofexperts model, HME3M. Simulations clearly showHME3M to outperform both PLR and SVM with linear,polynomial and radial basis kernels. When applied toactual metabolic networks with real microarray dataHME3M not only maintained its superior performancebut also produced biologically meaningful results.Naturally it would be interesting to explore the perfor-

mance of HME3M in other contexts where the proper-ties of the datasets and networks are different. Futurework on HME3M could be to assess the performance ofdifferent pathway activity definitions, other than simply


of 9

over expressed genes. Furthermore, the 3M componentof HME3M is also able to be extended to include othergene information such as protein class and function.Incorporating additional information on specific genefunctions or using different pathway definitions wouldallow HME3M to examine metabolic pathways at severalresolutions and help improve the understanding of theunderlying dynamics of the metabolic network.

MethodsHierarchical Mixture of Experts (HME)A HME is an ensemble method for predicting theresponse where each model in the ensemble is weightedby probabilities estimated from a hierarchical frameworkof mixture models [18]. Our model is the simplest twolevel HME, where at the top is a mixture model to findclusters within the dataset, and at the bottom are theexperts, weighted in the direction of each mixing com-ponent, used to classify a response. Given a responsevariable y and predictor variables x, a 2-layer HME hasthe following form,

p y x p m x p y xm m m m

m

M

( | , , , , , , ) ( | , ) ( | , ). 1 1

1

(1)

where bm are the parameters of each expert and θmare the parameters of mixture component m. A HMEdoes not restrict the source of the mixture weights p(m|x, θm) and as such can be generated from any modelthat returns posterior component probabilities for theobservations. Taking advantage of this flexibility we pro-pose a HME as a method to supervise the Markov mix-ture model for metabolic pathways 3M [9]. CombiningHME with a Markov mixture model first employs theMarkov mixture to find dominant pathways. Posteriorprobabilities are then assigned to each sequence basedon its similarity to the dominant pathway. These arethen passed as input weights into the parameter estima-tion procedure within the supervised technique. Usingthe posterior probabilities of 3M to weight the para-meter estimation of each supervised technique is ineffect localizing each expert to summarize the predictivecapability of each dominant pathway. Therefore incor-porating the 3M Markov mixture model within a HMEis creating a method capable of combining networkstructures with standard data table information. Wenow formally state the base 3M model and provide thedetail of our proposed model, Hierarchical MixtureExperts 3M (HME3M) classifier.

3M Mixture of Markov ChainsThe 3M Markov mixture model assumes that pathwaysequences can be represented with a mixture of firstorder Markov chains [9]. The full model form spanningM components estimating the probabilities of T transi-tions is,

p x p m x

p c p c x c

m

m

m m

m

M

t t t tm

t

T

( ) ( | , )

( | ) ( , | ; )

1

1 1

1

1

2

(2)

where πm is the mixture model component probabil-ity, p(c1|θ1m) is the probability of the initial state c1, andp(ct, xt|ct-1, θtm) is the probability of a path traversingthe edge xt linking states ct-1 and ct. The 3M model issimply a mixture model and as such its parameters areconveniently estimated by an EM algorithm [9]. Theresult of 3M is M mixture components, where eachcomponent, m, corresponds to a first order Markovmodel defined by θm = {θ1m, [θ2m, ..., θtm, ..., θTm]}which are the estimated probabilities for each transitionalong the mth dominant path.HME3MThe HME model combining 3M and a supervised tech-nique for predicting a response vector y can be achievedby using the 3M mixture probabilities p(m|x, θm) (2),for the HME mixture component probabilities in (1).This yields the HME3M likelihood,

p y x p m x p y x

p y x p c p

m m

m

M

m m m

m

M

( | ) ( | , ) ( | , )

( | , ) ( | ) (

1

1 1

1

cc x ct t t tm

t

T

, | ; ) 1

2

(3)

The parameters of (3) can be estimated using the EMalgorithm by defining the esponsibilities variable him tobe the probability that a sequence i belongs to compo-nent m, given x, θm, bm and y. These parameters areiteratively optimized with the following E and M steps:E-Step: Define the responsibilities him:

h mp m xi m p yi xi m

mp m xi m p yi xi mmMim

( | , ) ( | , )

( | , ) ( | , )1(4)

M-Step: Estimate the Markov mixture and expertmodel parameters:


of 9

(1) Estimate the mixture parameters

m tm

himiN

himiN

mM

xit himiN

himiN

1

11

11

1and

( )(5)

where δ (xit = 1) denotes whether a transition t isactive within observation i, or xit = 1. This conditionenforces the constraint that the probabilities of each setof transitions between any two states must sum to one.Additionally it can be shown that for this model allinitial state probabilities p(c1|θ1m) = 1.(2) Estimate the expert parametersUsing a weighted logistic regression for each expert,

l h h y x log em im im i mT

ix

i

N

m

mT

i( | ) arg max ( )

1

1 (6)

The original implementation of HME estimates theexpert parameters, bm, with the Iterative ReweightedLeast Squares (IRLS) algorithm, where the HMEweights, him are included multiplicatively by furtherreweighting the standard IRLS weights [10]. The IRLSiterations are Newton-Raphson steps with normal equa-tions defined by,

mnew T

mT

m mX W X X W z ( ) 1 (7)

where y is the vector of probabilities p x mold( ; ) and

Wm is a diagonal matrix of weights such thatw h y ymii im i i ˆ ( ˆ )1 and zm is the working response forthe IRLS algorithm z X W y ym m

oldm ( ( )) 1 . How-

ever, in this setting, X is a sparse matrix of binary path-ways where we expect and are explicitly looking fordominant pathways. Thus, simple IRLS maximization of(6) is likely to be inaccurate. Furthermore, the severityof the sparsity within X is compounded by the addi-tional weighting required by the experts’ inclusion intothe HME architecture. These conditions will manifestthemselves in duplicate rows within X, causing rankdeficiency and results in unstable estimates for the para-meters of a logistic regression model. Therefore the sim-ple IRLS scheme proposed by [10] is inappropriate foruse in this case. To overcome the rank deficiency issuewe propose using a regularized form of logistic regres-sion [19].Penalized logistic regression (PLR)Penalized Logistic Regression (PLR) uses a penalty [20]to allow for the coefficients of logistic regression to berun over a sparse or large dataset. In this paper the useof PLR is necessary to overcome the rank deficient nat-ure of the data matrix and allow for stable estimation of

the HME3M parameters. PLR maximizes bm subject toa ridge penalization |bm|2 controlled by l � [0, 2],

( | ) arg max ( ) | |

m im im i m

Ti

xm

i

N

h h y x log em

mT

i

1

22

1

(8)

The size of l directly affects the size of the estimatesfor bm. As l approaches 2 the estimates for bm willbecome more sparse, and as l approaches 0 the esti-mates for bm approach the IRLS estimates. In this casewe choose the ridge penalty for reasons of computa-tional simplicity. The ridge penalty allows the regulariza-tion to be easily included within the estimation by asimple modification to the Netwon-Raphson steps (7).The Iterative Reweighted Ridge Regression (IRRR) equa-tions are given by,

mnew T

mT

m mX W X X W z ( ) 1 (9)

where Λ is a P × P diagonal matrix with l along thediagonal where P is the number of variables in X and zmis the working response as specified in (7).However, another issue is that the Iterative

Reweighted Least Squares algorithm (IRLS) used forestimating the parameters of a PLR is known to beunstable and not guaranteed to converge [20].Furthermore our personal experience of IRLS in the

HME context indicates the need for additional controlover the rate of learning of the experts. This experiencesuggests that if the PLR iterations converge too quicklythe estimates of bm reach a local optimum. A subse-quent effect is the HME likelihood in the followingiterations becomes erratic as the EM responsibilities (4)are dominated by the PLR probabilities p(y|x, bm) whichdo not necessarily reflect the structure within the 3Mparameters. The different rates of convergence betweenthe 3M and PLR parameters can cause instabilities inthe HME3M likelihood. This problem has been notedby [18] and a solution is proposed by the imposition ofa learning rate on the gradient descent form of the IRLSalgorithm. This gradient descent method ensures that ateach iteration, a step will be taken to maximize bm, asufficient condition for the EM algorithm. However thismethod allows for control of the learning rate of theexperts by the imposition of a learning penalty a � [0, 1]on the coefficient updates. The parameter update forgradient descent PLR regularization is then computedby:

mnew

mold T

mT

imX W X X h y y ( ) ( ( )) 1 (10)


of 9

where Λ is a diagonal matrix with the regularizationparameter l along the diagonal and Wm is a diagonalmatrix of observation weights combining informationfrom the IRLS algorithm and the HME architecture.The observation weights are defined to be

W h y ym imii ˆ( ˆ)1 , where ˆ( ˆ)y y1 weights the observa-

tions to optimally predict y by ˆ( )

ye m

T X

1

1 sourced

from the IRLS algorithm, and him are the EM responsi-bilities (4). This update for bm gives control over thesize of the coefficients through l and speed in whichthese parameters are learned through a. It is noted by[18] that this method will converge to the same solutionas the IRLS method, however the effect of a willincrease the number of iterations for convergence. In(10) the action of l is to control the size of each bm byartificially inflating their variance.

AcknowledgementsTimothy Hancock was supported by a Japan Society for the Promotion ofScience (JSPS) fellowship and BIRD. Hiroshi Mamitsuka was supported in partby BIRD of Japan Science and Technology Agency (JST).

Authors’ contributionsTH and HM developed the method and conceived the experimentaldesigns. TH implemented the method and performed the experiments. Allauthors read and approved the final manuscript.

Competing interestsThe authors declare that they have no competing interests.

Received: 12 August 2009Accepted: 4 January 2010 Published: 4 January 2010

References1. Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes.

Nucleic Acids Res 2000, 28:27-30.2. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J,

Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG,Oezcimen A, Rocca-Serra P, Sansone S: ArrayExpress-a public repositoryfor microarray gene expression data at the EBI. Nucl Acids Res 2003,31:68-71.

3. Pang H, Lin A, Holford M, Enerson B, Lu B, Lawton MP, Floyd E, Zhao H:Pathway analysis using random forests classification and regression.Bioinformatics 2006, 22(16):2028-36.

4. Pireddu L, Poulin B, Szafron D, Lu P, Wishart DS: Pathway Analyst -Automated Metabolic Pathway Prediction. Proceedings of the 2005 IEEESymposium on Computational Intelligence 2005http://metabolomics.ca/News/publications/2005cibcb-path.pdf.

5. Jordan M: Learning in Graphical Models Norwell, MD: Kluwer AcademicPublishers 1998.

6. Imoto S, Goto T, Miyano S: Estimation of genetic networks and functionalstructures between genes by using Bayesian networks andnonparametric regression. Proc Pac Symp on Biocomputing 2002, 7:175-186.

7. Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian networks toanalyze expression data. RECOMB 2000, 127-135.

8. Evans WJ, Grant GR: Statistical methods in bioinformatics: An introductionNew York: Springer, 2 2005.

9. Mamitsuka H, Okuno Y, Yamaguchi A: Mining biologically active patternsin metabolic pathways using microarray expression profiles. SIGKDDExplorations 2003, 5(2):113-121.

10. Jordan M, Jacobs R: Hierarchical mixtures of experts and the EMalgorithm. Neural Computation 1994, 6(2):181-214.

11. Dimitdadou E, Hornik K, Leisch F, Meyer D, Weingessel A: e1071 - miscfunctions of the Department of Statistics. 2002http://cran.r-project.org/.

12. Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M,Schölkopf B, Weigel D, Lohmann JU: A gene expression map ofArabidopsis thaliana development. Nature Genetics 2005, 37(5):501-506.

13. Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D’Angelo C,Bornberg-Bauer E, Kudla J, Harter K: The AtGenExpress global stressexpression data set:protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. The Plant Journal 2007, 50:347-363.

14. Baxter C, Redestig H, Schauer N, Repsilber D, Patil K, Nielsen J, Selbig J,Liu J, Fernie A, Sweetlove L: The metabolic response of heterotrophicArabidopsis cells to oxidative stress. Plant physiology 2007, 143:312.

15. Chawade A, Bräutigam M, Lindlöf A, Olsson O, Olsson B: Putative coldacclimation pathways in Arabidopsis thaliana identified by a combinedanalysis of mRNA co-expression patterns, promoter motifs andtranscription factors. BMC Genomics 2007, 8:304.

16. Ndimba BK, Chivasa S, Simon WJ, Slabas AR: Identification of Arabidopsissalt and osmotic stress responsive proteins using two-dimensionaldifference gel electrophoresis and mass spectrometry. Proteomics 2005,5(16):4185-4196.

17. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M,Foerster H, Li D, Meyer T, Muller R, Ploetz L, Radenbaugh A, Singh S,Swing V, Tissier C, Zhang P, Huala E: The Arabidopsis InformationResource (TAIR): gene structure and function annotation. Nucl Acids Res2007, 36:D1009-14.

18. Waterhouse SR, Robinson AJ: Classification Using Mixtures of Experts. IEEEWorkshop on Neural Networks for Signal Processing 1994, , IV: 177-186.

19. Park MY, Hastie T: Penalized logistic regression for detecting geneinteractions. Biostatistics 2008, 9(1):30-50.

20. Hastie T, Tibshirani R, Friedman J: Elements of Statistical Learning New York:Springer 2001.

doi:10.1186/1748-7188-5-10Cite this article as: Hancock and Mamitsuka: A markov classificationmodel for metabolic pathways. Algorithms for Molecular Biology 2010 5:10.

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral


of 9

http://www.ncbi.nlm.nih.gov/pubmed/10592173?dopt=Abstract




http://metabolomics.ca/News/publications/2005cibcb-path.pdf

http://metabolomics.ca/News/publications/2005cibcb-path.pdf

http://cran.r-project.org/



















http://www.biomedcentral.com/

http://www.biomedcentral.com/info/publishing_adv.asp

http://www.biomedcentral.com/

a markov classification model for metabolic pathways

Documents