finding maximum colorful subtrees in practice · finding maximum colorful subtrees in practice...

13
Finding Maximum Colorful Subtrees in practice Imran Rauf 1, * , Florian Rasche 2 , Fran¸ cois Nicolas 2 , and Sebastian B¨ ocker 2 1 Department of Computer Science, University of Karachi, Karachi, Pakistan [email protected] 2 Lehrstuhl f¨ ur Bioinformatik, Friedrich-Schiller-Universit¨ at Jena, Jena, Germany {florian.rasche,francois.nicolas,sebastian.boecker}@uni-jena.de Abstract. In metabolomics and other fields dealing with small com- pounds, mass spectrometry is applied as sensitive high-throughput technique. Recently, fragmentation trees have been proposed to au- tomatically analyze the fragmentation mass spectra recorded by such instruments. Computationally, this leads to the problem of finding a maximum weight subtree in an edge weighted and vertex colored graph, such that every color appears at most once in the solution. We introduce new heuristics and an exact algorithm for this Maximum Colorful Subtree problem, and evaluate them against existing algo- rithms on real-world datasets. Our tree completion heuristic consistently scores better than other heuristics, while the integer programming-based algorithm produces optimal trees with modest running times. Our fast and accurate heuristic can help to determine molecular formulas based on fragmentation trees. On the other hand, optimal trees from the integer linear program are useful if structure is relevant, e.g., for tree alignments. 1 Introduction Mass spectrometry is one of the prevalent technologies for the analysis of metabolites and other small compounds with mass below 1000 Da. The study of such compounds is, for example, relevant in drug design and the search for new signaling molecules and biomarkers [11]. Typically, the analyte is fragmented in the mass spectrometer such that fragment masses and their intensities can be recorded in a fragmentation spectrum. Usually, this spectrum is then compared to a database of reference spectra [13]. Unfortunately, such databases are vastly incomplete [6]. Thus, de novo approaches, which try to identify the compound solely based on its spectrum, are sought. Whereas de novo interpretation of protein spectra makes use of their specific structure, this is not possible for metabolites, as they are not structurally restricted. * This work was carried out while the author was a member of the Chair of Bioinfor- matics at the Friedrich-Schiller-University Jena

Upload: others

Post on 02-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice

Imran Rauf1,∗, Florian Rasche2,Francois Nicolas2, and Sebastian Bocker2

1 Department of Computer Science, University of Karachi, Karachi, [email protected]

2 Lehrstuhl fur Bioinformatik, Friedrich-Schiller-Universitat Jena, Jena, Germany{florian.rasche,francois.nicolas,sebastian.boecker}@uni-jena.de

Abstract. In metabolomics and other fields dealing with small com-pounds, mass spectrometry is applied as sensitive high-throughputtechnique. Recently, fragmentation trees have been proposed to au-tomatically analyze the fragmentation mass spectra recorded by suchinstruments. Computationally, this leads to the problem of finding amaximum weight subtree in an edge weighted and vertex colored graph,such that every color appears at most once in the solution.We introduce new heuristics and an exact algorithm for this MaximumColorful Subtree problem, and evaluate them against existing algo-rithms on real-world datasets. Our tree completion heuristic consistentlyscores better than other heuristics, while the integer programming-basedalgorithm produces optimal trees with modest running times. Our fastand accurate heuristic can help to determine molecular formulas basedon fragmentation trees. On the other hand, optimal trees from the integerlinear program are useful if structure is relevant, e.g., for tree alignments.

1 Introduction

Mass spectrometry is one of the prevalent technologies for the analysis ofmetabolites and other small compounds with mass below 1000 Da. Thestudy of such compounds is, for example, relevant in drug design andthe search for new signaling molecules and biomarkers [11]. Typically,the analyte is fragmented in the mass spectrometer such that fragmentmasses and their intensities can be recorded in a fragmentation spectrum.Usually, this spectrum is then compared to a database of referencespectra [13]. Unfortunately, such databases are vastly incomplete [6].Thus, de novo approaches, which try to identify the compound solelybased on its spectrum, are sought. Whereas de novo interpretation ofprotein spectra makes use of their specific structure, this is not possiblefor metabolites, as they are not structurally restricted.

∗ This work was carried out while the author was a member of the Chair of Bioinfor-matics at the Friedrich-Schiller-University Jena

Page 2: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

2 I. Rauf, F. Rasche, F. Nicolas, S. Bocker

In 2008, Bocker and Rasche [2] suggested fragmentation trees for thede novo interpretation of metabolite tandem mass spectrometry data.Fragmentation trees can be used to identify the molecular formula of acompound [2], as well as to describe the fragmentation process [15]. Frag-mentation trees can also be used for the interpretation of multiple massspectrometry (MSn) [16]. The automated alignment of such fragmentationtrees [14] seems to be a particularly promising approach, allowing a fullyautomated computational identification of small molecules that cannotbe found in any database, and paving the way towards natural productdiscovery, searching for signaling molecules, biomarkers, novel drugs orones that are illegal, or other “interesting” organic compounds.

It must be understood that, depending on the application at hand,different aspects of fragmentation tree computation are most relevant:In case we want to identify the molecular formula of an unknown [2]then swift computations are mandatory, as we might have to computehundreds of trees for a single compound. But if we want to (manuallyor automatically) interpret the structure of fragmentation trees [14, 15],then it is presumably beneficial to use optimal solutions.

Calculating fragmentation trees leads to the Maximum ColorfulSubtree problem [2]:

Maximum Colorful Subtree problem. Given a vertex-colored DAGG = (V,E) with colors C and weights w : E → R. Find the induced color-ful subtree T = (VT , ET ) of G of maximum weight w(T ) :=

∑e∈ET

w(e).

This is a special case of the edge-weighted Graph Motif problem, see[18] for an overview. Scheubert et al. [16] present the related ColorfulSubtree Closure problem for analyzing multiple mass spectrometrydata. Ljubıc et al. [12] presented an Integer Linear Program for therelated Prize-Collecting Steiner Tree problem. The MaximumColorful Subtree problem is NP-hard [5] as well as APX-hard [3]even on binary trees. Furthermore, on general trees it has no constantfactor approximation [3, 19].

This paper presents an experimental study of various heuristics for theMaximum Colorful Subtree problem. We implement a new heuristicalong with naive greedy approaches, and three exact algorithms for theproblem. The various algorithms are then compared with respect to theirrunning times and quality of the generated trees. As a side result, we provethat even after relaxing the color constraints, the resulting MaximumSubtree problem remains inapproximable within O(|V |1−ε) factor forany ε > 0, unless P = NP. Our proof is simpler and improves upon

Page 3: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice 3

the previous APX-hardness result due to [3] under the same complexityhypothesis. Sikora showed a slightly weaker inapproximability result forunweighted graphs, but maintaining the color constraints [19].

The rest of the paper is organized as follows: The next section definesthe Maximum Colorful Subtree problem and describes how it isapplied to analyze tandem mass spectra. In Section 3, we discuss thecomputational complexity of the closely related Maximum Subtreeproblem, and present exact and heuristic algorithms for Maximum Col-orful Subtree. In Section 4, we evaluate these algorithms on threereal-world datasets.

2 Computing Fragmentation Trees

Here we outline our method of computing the fragmentation tree of ametabolite compound from its tandem mass spectrum. This leads to aformal definition of the Maximum Colorful Subtree problem. Letm1, . . . ,mp be the peak masses of a spectrum consisting of p peaks. Wefirst compute a list of possible molecular formulas {F 1

i , . . . , Fnii } with

mass close to mi, for each 1 ≤ i ≤ p. We then construct a directed acyclicgraph G, where the set of vertices are the molecular formulas F ji of thei-th peak, for 1 ≤ j ≤ ni and 1 ≤ i ≤ p. Next, we label the vertices withcolors from {1, . . . , p}, where all vertices corresponding to the i-th peakare assigned color i. We connect two vertices u, v with a directed edgeuv, if the molecular formula of v is a sub-formula of the formula of u,i. e., v contains at most as many atoms of every chemical element as u.The edge uv thus represents a possible fragmentation step of the moleculeassociated with u. Note that G is acyclic, since the subset relation usedto define edges is a partial order.

Let us assume that the largest peak mass is m1 which correspondsto the unfragmented compound whose fragmentation tree we want tocompute. We further assume without loss of generality that n1 = 1,otherwise we can consider a separate graph for each molecular formulaexplaining the peak mass m1. So, we may assume that there is only asingle source vertex r associated with the unfragmented compound. Wedelete all other vertices without incoming edges from the graph, sincethey cannot be part of the fragmentation process. Consequently, G is adirected acyclic graph (DAG) with unique source r such that every othernode is reachable from r.

A subtree of G is colorful if any two vertices in the tree have differentcolors. A colorful subtree of G rooted at r explains each peak in the input

Page 4: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

4 I. Rauf, F. Rasche, F. Nicolas, S. Bocker

m/z 365.187Int.: 6.8

m/z 195.065Int.: 100.0

m/z 236.128Int.: 15.0

m/z 167.070Int.: 10.6

m/z 152.047Int.: 7.4

m/z 174.092Int.: 61.1

m/z 192.102Int.: 8.5

m/z 363.170Int.: 9.1

m/z 448.198Int.: 13.9

m/z 609.275Int.: 100.0

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Fig. 1. Input graph calculated from the spectrum of Reserpine (Micromass dataset).Formulas of identically colored nodes have nearly identical masses. Edges are drawnbetween vertices if the molecular formula of one vertex is a sub-formula of the molecularformula of the other.

mass spectrum at most once. The approach in [15] works by scoring edgesby log likelihoods of the fragmentation steps they correspond to. Thus,edges may have negative weights. The maximum colorful subtree rootedat r will then be the most likely explanation of the given mass spectrum.

3 Complexity results and algorithms

We now consider a relaxation of the maximum colorful subtree problemobtained by dropping the color constraints. The resulting MaximumSubtree problem asks to find a maximum weight tree rooted at aparticular vertex in the given edge-weighted DAG G = (V,E). It turnsout that this simpler problem is still NP-hard. In fact, this problem isNP-hard to approximate within O

(|V |1−ε

)factor for any ε > 0.

Theorem 1. There is no O(|V |1−ε

)approximation algorithm for the

Maximum Subtree problem for any ε > 0, unless P = NP.

Note that the inapproximability result from Theorem 1 also holds forthe Maximum Colorful Subtree problem, since it is a generalization

Page 5: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice 5

of the Maximum Subtree problem. We defer the proof of Theorem 1 tothe full version of this paper.

Now, we briefly review exact algorithms and heuristics for the Maxi-mum Colorful Subtree problem. For vertices u and v, let c(v) be thecolor assigned to v and w(u, v) ∈ R the weight of the edge uv. Throughoutthe rest of the paper we denote the number of colors in G = (V,E) by k.Note that k ≤ p can be as large as the number of peaks in the spectrum;but we can also choose a smaller k to decrease running times, limitingour attention to, say, the k most intense peaks.

3.1 Exact Methods

The problem can be solved exactly using dynamic programming oververtices and color subsets [4]. Let W (v, S) be the maximal score of acolorful tree with root v and color set S ⊆ C. Now, table W can becomputed by the following recurrence [2]:

W (v, S) = max

max

u:c(u)∈S\{c(v)},vu∈EW (u, S \ {c(v)}) + w(v, u)

max(S1,S2):S1∩S2={c(v)},S1∪S2=S

W (v, S1) +W (v, S2)

where, obviously, we have to exclude the cases S1 = {c(v)} and S2 ={c(v)} from the computation of the second maximum. Using the aboverecurrence with the initial condition W (v, {c(v)}) = 0, we can computea maximum colorful tree in O(3kk |E|) time and O(2k |V |) space. Theexponential running time and space make the algorithm useful only forsmall size instances. The running time can be somewhat improved toO(2k · poly(|V | , k)) by using the Mobius transform and the inversiontechnique of Bjorklund et al. [1]. However, the technique only works forsuitably small integer weights.

Guillemot and Sikora [7] suggest a different approach using multilineardetection [10] for input graphs with unit weights. Their algorithm requiresO(2k · poly(|E| , k)) time and only polynomial space. The algorithm canbe adopted to integer weight graphs in a straight forward manner butthe resulting algorithm would be pseudo-polynomial, i.e., its runningtime would depend polynomially on the integer weights thus making itimpractical for our purposes. To the best of our knowledge, neither theabove algorithm nor the dynamic programming with Mobius transformof the previous paragraph have been used in implementations.

For small instances a brute-force approach is suggested in [2]. Theidea is to find a maximum subtree for each possible combination of

Page 6: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

6 I. Rauf, F. Rasche, F. Nicolas, S. Bocker

vertices forming a colorful set. We then search a maximum subtree ina colorful DAG. Clearly, when all edge weights are positive, then themaximum subtree is a spanning tree. This can be found by a simplegreedy algorithm, choosing the maximum weight incoming edge for eachvertex but the root. With arbitrary edge weights, the problem becomesNP-hard, see Theorem 1. We solve the problem naively by iterating overall combinations of vertices whose best incoming edge has a negativeweight. The brute-force approach is obviously not practical when eitherthe number of combinations is large or when there are many verticeswhose maximum incoming edge has negative weight.

Integer Linear Programming. Let us define a binary variable xuvfor each edge uv of the input graph. For each color c ∈ C let V (c) bethe set of all vertices in G = (V,E) which are colored with c. Then thefollowing simple Integer Linear Program (ILP) captures the maximumcolorful subtree problem:

max∑uv∈E

w(u, v) · xuv (1)

s.t.∑

u with uv ∈ Exuv ≤ 1 for all v ∈ V \ {r}, (2)

xvw ≤∑

u with uv ∈ Exuv for all vw ∈ E with v 6= r, (3)∑

uv ∈ E with v ∈ V (c)

xuv ≤ 1 for all c ∈ C, (4)

xuv ∈ {0, 1} for all uv ∈ E. (5)

Constraint set (2) ensures that the feasible solution is a forest, whereas theconstraint set (4) make sure that there is at most one vertex of each colorpresent in the solution. Finally, (3) requires the solution to be connected.Note that in general graphs, we would have to ensure for every cut ofthe graph to be connected to some parent vertex. That would require anexponential number of constraints [12]. But since our graph is directedand acyclic, a linear number of constraints suffice.

3.2 Heuristics

A simple greedy heuristic has been proposed in [2]. It works by consideringthe edges according to their weights in descending order. The edge

Page 7: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice 7

being considered is added to the result, if it does not conflict with thepreviously picked edges. The algorithm continues until all positive edgesare considered and the resulting graph is connected. Note that an edgeconflicts with another if they either are incoming edges to the same vertexor are incident edges to different vertices of the same color. Finally, weprune the leaves which are attached by negative weight edges in theresulting spanning tree. We refer to the above heuristic as greedy in therest of the paper.

Another greedy strategy is to consider colors in some ordering and forthe current color add an vertex of that color that promises the maximumincrease of the score and attaches it to the already calculated tree. Theresulting heuristic, called insertion heuristic in the rest of the paper,begins with only the root as the current partial solution. The heuristicgreedily attaches vertices labeled with unused colors. For every vertex uwith unused color, and every vertex v already part of the solution, wecalculate how much we gain by attaching u to v. To calculate the gain ofattaching u to v, we take into account the score of the edge vu, as wellas the possibility of rerouting other outgoing edges of v through u. Thevertex with maximum gain is then attached to the solution, and edges arererouted as required. See Algorithm 1 for details. In our implementation,we choose the colors in Line 2 of the algorithm in the descending orderwith respect to the intensities of their corresponding peaks in the real-world datasets.

Algorithm 1 Vertex insertion heuristicInput: An edge weighted and vertex colored DAG G, r ∈ V (G) and color set C.Output: A colorful subtree rooted at r.1: T := {r}.2: for all c ∈ C such that c does not appear in T do3: Let v be a c-colored vertex which maximizes the quantity

w(u, v) +∑

w(v,x)>w(u,x)

(w (v, x)− w (u, x)) ,

where u, x are vertices in T .4: Add uv to T .5: Add vx and remove ux for all x ∈ T for which w(v, x) > w(u, x) holds.6: return T .

Tree Completion Heuristic. As noted before, the dynamic program-ming approach of Section 3.1 works only for small inputs. We now present

Page 8: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

8 I. Rauf, F. Rasche, F. Nicolas, S. Bocker

a heuristic that combines DP with the greedy approaches. For a smallenough constant b, the heuristic works by first computing the maximumcolorful subtree consisting of at most b vertices, which we call backboneof a candidate solution. Next, we complete the backbone by using one ofthe greedy heuristics discussed above.

The tree completion heuristic works with Algorithm 1 by initializing Tin Line 1 with the computed backbone. Similarly, the greedy heuristic canbe used to complete the tree by starting with the backbone and applyingthe greedy heuristic on the remaining edges. In our experiments, we usethe insertion heuristic for tree completion since it achieved consistentlybetter scores. This heuristic is referred to as DPb, where b is the size ofthe backbone computed exactly.

4 Evaluation results

In our study, we analyze spectra from three real-world datasets. The anal-ysis of spectra from two artificial datasets (one random, one constructed“computationally hard”) is deferred to the full version of this paper. TheOrbitrap dataset consists of mass spectra of 38 compounds with a massaccuracy of 10 ppm. It includes the 37 compounds used for evaluatingfragmentation trees in [15]. The Micromass dataset [8] contains spectra of100 compounds with an accuracy of 50 ppm, while the QSTAR dataset [2]consists of 36 mass spectra with 20 ppm accuracy. Note that the laterdataset contains only 14.3 peaks per compound on average, whereas theaverage for Orbitrap and Micromass datasets is 75.4 and 51.6 peaks percompound, respectively. The Orbitrap dataset contains 1.03 vertices percolor, the Hill dataset 1.7 and the QSTAR dataset has 1.1 vertices percolor. Despite these low ratios, the number of colorful vertex sets growsas large as 1055 for certain instances, due to the combinatorial explosion.Experimental details can be found in the corresponding publications.

For each compound in the above three datasets, we assume that weknow the correct molecular formula and construct a directed acyclic graphas described in Section 2. We use the scoring from [15] to weight the edges.

We implemented the exact algorithms based on the dynamic program,the integer linear program and the brute-force approach of Section 3.1. Wealso implemented the greedy heuristic and the tree completion heuristicwith backbone size 10 and 15 (DP10, DP15) that use insertion to completethe backbone. To evaluate an heuristic on an instance of the problem, weconsider its performance ratio, i. e., the ratio of the weight of generatedsolutions versus the optimal.

Page 9: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice 9

The algorithms are implemented in Java 1.6 by using an adjacencylist representation for graphs. In the DP algorithm, we use the Java long

data type to represent sets of colors as bitsets. This limits the maximumpossible size of the color set to 64. Memory usage of this algorithmbecomes prohibitive long before this number is reached. The experimentswere run on a Lenovo T400 laptop powered with dual core Intel P8600at 2.40 GHz with 2 GB of RAM and running Ubuntu Lucid Lynx asan operating system. Our implementation however is single threaded anddoes not exploit the availability of multiple cores in the system. Theinteger linear programming solver, however, uses multiple cores. We runthe experiments with default heap size on a Sun Java server virtualmachine and use the Gurobi Optimizer for solving the Integer LinearPrograms.3

For our applications, an algorithm is sufficiently fast if it runs in lessthan ten seconds, since this is usually faster than the data can be acquired.Among the exact algorithms, only the ILP managed to solve all instancesof our datasets. Using the ILP the running time stayed under 5.6 minutesper instance while for about 95% of the instances, it terminated in atmost 5 seconds.

The brute force algorithm mentioned in Section 3.1 runs fast on mostinstances of the Orbitrap and QSTAR dataset. Due to the high massaccuracy and the small compound sizes in these datasets there are onlyfew explanations per peak and thus few vertices with the same colorin the input graph. Edges with negative weights are rare. Thus, thealgorithm terminates in under a second for all but three instances inOrbitrap and QSTAR datasets. But for two compounds from the Orbitrapdataset and 37 compounds in the Micromass dataset, the algorithm doesnot terminate in 12 hours and a week, respectively. The DP algorithm(Section 3.1) was able to solve the QSTAR instances exactly, since thenumber of colors and vertices is small for this dataset.

In Figure 2, we present performance ratios achieved by several heuris-tics. The tree completion heuristics (DP10 and DP15) work very well withan output tree of weight at least 80 percent of the optimal for the DP15

variant. On the other hand, the greedy heuristic performs inferior to bothDP10 and DP15. The insertion heuristic performs better than the greedyheuristic on real datasets. We also observe improved performance for thetree completion heuristic as we increase the parameter, i. e., the size ofthe backbone computed exactly. But the performance increase is onlymarginal for real-world datasets, as can be seen in Figure 2. The tree

3 Gurobi Optimizer 4.5. Houston, Texas: Gurobi Optimization Inc., April 2011.

Page 10: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

10 I. Rauf, F. Rasche, F. Nicolas, S. Bocker

40%

60%

80%

100%

Pe

rce

nt

of

inst

an

ces

0.9—1.0

0.8—0.9

0.7—0.8

0.6—0.7

0.4—0.6

0%

Inse

rtio

n

DP

10

DP

15

Gre

ed

y

Inse

rtio

n

DP

10

DP

15

Gre

ed

y

Micromass Orbitrap

0.2—0.4

0.0—0.2

≤0

Perform. ratio

Fig. 2. Performance ratios achieved by different heuristics on Micromass and Orbitrapdatasets. Results on the QSTAR dataset were rather uninformative, as nearly allinstances reached a performance ratio above 0.9.

completion heuristic becomes infeasible when the size of the backbone is≥ 25. In this case, more than half of the instances from Orbitrap andMicromass datasets fail to terminate in less than a week.

The insertion, greedy and DP10 heuristics are fast with running timeswell under a second, whereas the DP15 heuristic terminates in less than8 seconds for all instances. The algorithm based on integer programmingalso finished in at most 16 seconds while it was actually faster on most ofthe instances. Figure 3 presents the breakdown of datasets depending onhow much time it took to solve them using different algorithms. Runningtimes exclude construction of the graph representations from MS data.

5 Conclusions

We have discussed several exact and heuristic algorithms for the Maxi-mum Colorful Subtree problem, and also shown its inapproximabilityeven for colorful input graphs. The Maximum Colorful Subtreeproblem is relevant for the de novo analysis of metabolite mass spectra.

Experiments on five different sets of graphs from actual metabolitespectra reveal that for smaller test instances, the brute force and dynamicprogramming based algorithms can calculate the optimal solution quickly.When the structure of the fragmentation tree is of interest, for example for

Page 11: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice 11

40%

60%

80%

100%

cen

t o

f in

sta

nce

s

>15 sec

10—15 sec

5—10 sec

2—5 sec

0%

20%

DP10 DP15 Greedy ILP DP10 DP15 Greedy ILP

Micromass Orbitrap

Pe

1—2 sec

≤1 sec

Time

Fig. 3. Running times taken by different heuristics on Micromass and Orbitrapdatasets. Nearly all calculations on the QSTAR dataset finished in less than a second.

tree alignment [14], it is most probably beneficial to find exact solutions.Our tests show that the integer linear program performs best on thistask. When determining the sum formula, hundreds of instances haveto be solved for a single compound, but only the scores are relevant.In this case, the tree completion heuristic with parameter 10 provides agood performance ratio of 95% on average, and very fast running times.Increasing the parameter did increase the quality of results, but at theprice of highly increased running times.

Expert evaluations in [15] indicate that trees with optimal objectivefunction are structurally reasonable. Thus, we conjecture that this isgenerally the case, but, of course, we cannot guarantee that the treewith optimal objective function is the “true” tree. This is true for mostproblems in computational biology, however, so we deliberately ignoredthis fact.

Fast calculation of fragmentation trees is a prerequisite for the iden-tification of unknown metabolites from fragmentation mass spectra. It isrequired for molecular formula identification [15] and the classificationof unknowns using fragmentation tree alignment [14]. The algorithmspresented here can play an integral role in this fully-automated pipeline.

In the future, we intend to test whether “advanced” ILP techniques,such as branch-and-cut or column generation will further decrease runningtimes. It also remains open whether a modified version of the ILP

Page 12: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

12 I. Rauf, F. Rasche, F. Nicolas, S. Bocker

can solve the (presumably even harder) Colorful Subtree Closureinstances [17] in reasonable time.

Acknowledgments.

IR thanks Tsuyoshi Ito for giving the first idea of the proof of The-orem 1 [9]. Kerstin Scheubert suggested a simplification in the proof.We thank Ales Svatos (Max Planck Institute for Chemical Ecology inJena, Germany), Christoph Bottcher (Institute for Plant Biochemistry inHalle, Germany), and David Grant (University of Connecticut in Storrs,Connecticut) for supplying us with the test data.

References

1. A. Bjorklund, T. Husfeldt, P. Kaski, and M. Koivisto. Fourier meets Mobius: fastsubset convolution. In Proc. of ACM Symposium on Theory of Computing (STOC2007), pages 67–74. ACM Press New York, 2007.

2. S. Bocker and F. Rasche. Towards de novo identification of metabolites byanalyzing tandem mass spectra. Bioinformatics, 24:I49–I55, 2008.

3. R. Dondi, G. Fertin, and S. Vialette. Maximum motif problem in vertex-coloredgraphs. In Proc. of Symposium on Combinatorial Pattern Matching (CPM 2009),volume 5577 of Lect. Notes Comput. Sci., pages 221–235. Springer, Berlin, 2009.

4. S. E. Dreyfus and R. A. Wagner. The Steiner problem in graphs. Networks,1(3):195–207, 1972.

5. M. R. Fellows, J. Gramm, and R. Niedermeier. On the parameterized intractabilityof motif search problems. Combinatorica, 26(2):141–167, 2006.

6. A. R. Fernie, R. N. Trethewey, A. J. Krotzky, and L. Willmitzer. Metaboliteprofiling: from diagnostics to systems biology. Nat. Rev. Mol. Cell Biol., 5(9):763–769, 2004.

7. S. Guillemot and F. Sikora. Finding and counting vertex-colored subtrees. In Proc.of Symposium on Mathematical Foundations of Computer Science (MFCS 2010),volume 6281 of Lect. Notes Comput. Sci., pages 405–416. Springer, Berlin, 2010.

8. D. W. Hill, T. M. Kertesz, D. Fontaine, R. Friedman, and D. F. Grant. Massspectral metabonomics beyond elemental formula: Chemical database querying bymatching experimental with computational fragmentation spectra. Anal. Chem.,80(14):5574–5582, 2008.

9. T. Ito. Finding maximum weight arborescence in an edge-weighted DAG. The-oretical Computer Science – Stack Exchange. http://cstheory.stackexchange.

com/q/4088/189, Retrieved: Oct 12, 2011.

10. I. Koutis and R. Williams. Limits and applications of group algebras for param-eterized problems. In Proc. of International Colloquium on Automata, Languagesand Programming (ICALP 2009), volume 5555 of Lect. Notes Comput. Sci., pages653–664. Springer, Berlin, 2009.

11. J. W.-H. Li and J. C. Vederas. Drug discovery and natural products: end of anera or an endless frontier? Science, 325(5937):161–165, 2009.

Page 13: Finding Maximum Colorful Subtrees in practice · Finding Maximum Colorful Subtrees in practice Imran Rauf1;, Florian Rasche2, Fran˘cois Nicolas2, and Sebastian B ocker2 1 Department

Finding Maximum Colorful Subtrees in practice 13

12. I. Ljubic, R. Weiskircher, U. Pferschy, G. W. Klau, P. Mutzel, and M. Fischetti.Solving the prize-collecting Steiner tree problem to optimality. In Proc. ofAlgorithm Engineering and Experiments (ALENEX 2005), pages 68–76. SIAM,2005.

13. H. Oberacher, M. Pavlic, K. Libiseller, B. Schubert, M. Sulyok, R. Schuhmacher,E. Csaszar, and H. C. Kofeler. On the inter-instrument and inter-laboratorytransferability of a tandem mass spectral reference library: 1. results of an Austrianmulticenter study. J. Mass Spectrom., 44(4):485–493, 2009.

14. F. Rasche, K. Scheubert, F. Hufsky, T. Zichner, M. Kai, A. Svatos, and S. Bocker.Identifying the unknowns by aligning fragmentation trees. Manuscript, Oct 2011.

15. F. Rasche, A. Svatos, R. K. Maddula, C. Bottcher, and S. Bocker. Computingfragmentation trees from tandem mass spectrometry data. Anal. Chem., 83:1243–1251, 2011.

16. K. Scheubert, F. Hufsky, F. Rasche, and S. Bocker. Computing fragmentation treesfrom metabolite multiple mass spectrometry data. J. Comput. Biol., 18(11):1383–1397, 2011.

17. K. Scheubert, F. Hufsky, F. Rasche, and S. Bocker. Computing fragmentationtrees from metabolite multiple mass spectrometry data. In Proc. of Research inComputational Molecular Biology (RECOMB 2011), volume 6577 of Lect. NotesComput. Sci., pages 377–391. Springer, Berlin, 2011.

18. F. Sikora. An (almost complete) state of the art around the graph motifproblem. Technical report, Universite Paris-Est, France, 2010. Available fromhttp://www-igm.univ-mlv.fr/~fsikora/pub/GraphMotif-Resume.pdf.

19. F. Sikora. Aspects algorithmiques de la comparaison d’elements biologiques. PhDthesis, Universite Paris-Est, 2011.

20. K. Xu and W. Li. Many hard examples in exact phase transitions. Theor. Comput.Sci., 355(3):291–302, 2006.