xtree

CS618: Group 13Automatic X-tree from R-tree

Mohit Kumar Garg Shivanshu AgarwalStudent Id : 27 Student Id : 37

Roll No : 11431 Roll No. : [email protected] [email protected]

Dept. of CSE Dept. of CSEIndian Institute of Technology, Kanpur

Final report26th April, 2015

Abstract

Generally index structures are built with assuming uni-form query distribution, we are proposing an index struc-ture which dynamically modifies itself according to thequery distribution. In this project we are building a vari-ation of X-tree which is constructed from R-tree at thequerying time. It modifies itself on the basis of queries inorder to give better performance for future queries.

1 Introduction and Problem State-ment

When query operations are performed on high dimen-sional data, the probability of accessing a node is veryhigh, so a large number of internal and leaf nodes are ac-cessed in every query (curse of dimensionality). SinceR-tree is disk based structure, it performs a large numberof random I/O operations per query. This results in a poorperformance of R-tree at high dimensional data. To tacklethis problem X-tree was proposed. In X-tree, the conceptof supernode was introduced, where if a node overflowsthen rather than splitting it a new bigger node is created.When two nodes get accessed together most of the times,then it is a good idea to merge these two nodes in orderto save one random I/O operation. The criteria of formingsupernode takes into consideration the overlap volume ofthe splitted nodes. Here intrinsically we assume that ifthe queries are uniformly distributed than probability ofaccessing two nodes together will be propotional to theiroverlap volume. In AutoXtree, we rather calculate the

probability of accessing two nodes from the knowledgeof previous queries. This is an interesting problem as it isa dynamic structure which gets better and better as morequeries are performed on this tree.

2 ApproachIn order to keep the count of the number of access of a par-ticular node, in every internal node we keep a data struc-ture commonAccessArray for every internal node to storethe number of instances where a pair of child nodes areaccessed together. It is an array of β(β − 1) short in-tegers. We also keep an array accessArray of size β tostore the number of access of every child node. This willconsume space in a node but for high dimensional data itgenerally doesn’t affect the β. With every query we up-date the counts in commonAccessArray and accessArray.By using these access counts of the previous queries, wecan compute the probability of two child nodes gettingaccessed together. So it is a nice idea to take the decisionof forming a supernode on the basis of this probability ofcomman access. If the probability of together access ishigh then we can save one random I/O operation by merg-ing these two nodes and form a supernode. If we assumethat the future queries will follow the similar pattern, wecan say that probability of two nodes getting accessed to-gether will be

P (togetherAccess(A,B)) =n(A,B)

n(A) + n(B)

Here n(A) denotes the number of times node A wasaccessed. n(A,B) denotes the number of time node A and

1

node B were accessed together.

Algorithm 1 Accessing an internal nodeInput: node v , query (q,r), threshold χ

1: if v.isInternalNode then2: childrenAccessed← {}3: for child in v.children do4: if minDistance(child, q) < r then5: childrenAccessed.append(child)6: end if7: end for8: for (c1,c2) in childrenAccessed do9: accessArray[c1] + +

10: accessArray[c2] + +11: commonAccessArray[c1][c2] + +

12: if commonAccessArray[c1][c2]accessArray[c1]+accessArray[c2] > χ then

13: toBeMerged.append(v, c1, c2)14: end if15: end for16: end if

We update the commonAccessArray of every node inthe above manner and keep a list of the nodes that shouldbe merged. After the query we access these nodes in bot-tom up fashion and merge them pairwise, updating theMBR and pointer in their parent node.

Algorithm 2 Maintenance partInput: toBeMerged

1: while !toBeMerged.isEmpty() do2: (node, child1, child2)← toBeMerged.pop()3: Access node4: Create a newNode5: merge child1 and child2 into it6: update the MBR entry in node7: delete child1 and child28: end while

Intially, we tried out this approach and found out thatAutoXtree was performing very badly. After the analysiswe reached the conclusion that in our implemetation atquery time we were updating the commonAccessArrayof internal nodes. So we had to write them back to thedisk after each access. So there was the overhead of writeoperations, which was getting added in the query time.To handle this problem we created an in-memory datastructure to keep the access counts and commonAccesscounts of the nodes. We updated the access counts on thedisk at the end of the query. Now these operations will bea part of maintenance time.

3 Results

We tested AutoXtree on 3 different types of data distribu-tions : Uniform, Exponential, Gaussian.We tested for different dimensions : 2,5,7,9,10,20,30,50.We tried different values of threshold(χ) for merging thenodes.We compared the performance with R-tree and X-tree.

Intially, in our implementation we were updating theaccess values in the node at the query time. Due to thewrite overhead the performance was very poor. Then weintroduced a new in-memory data structure and update thecommonAccessArray in nodes after the query. This sig-nificantly improved the query time. As can be seen inFigure 1.

Figure 1: Difference after keeping in-memory data struc-ture

We tested the AutoXtree on synthetic dataset of variousdimensions. When the dimensionality is low the chancesof getting two nodes are less, so less supernodes will beformed. In the Figure 2 we have plotted the variationin the ratio of final number of nodes and intial numberof nodes with dimensionality. For 2 dimensional data,before the query starts there were 338 nodes. After 2000queries there were 324 nodes, i.e. only 14 supernodeswere formed. At low dimensional data AutoXtreeperforms as R-tree, but at higher dimensional data whereprobability of common access is high, more supernodeswill be formed. So the number of remaining nodes afterthe query will be very less. Figure 2 shows this variation.

2

Figure 2: ratio of ‘number of nodes after merging’ and‘number of nodes before merging’ vs dimension

While query processing we mark the pair of nodeswhich cross the threshold and keep them in a list. Afterthe query we merge them. The time taken to merge thesenodes and maintain the index structure by writing backthe accessArray, is termed as maintenance time. Whenthe dimnesionality is low, there will be less supernodeformation so less maintenance time. As the dimension-ality becomes common access probability also gets highand maintenance time also increases. In Figure 3 it canbe seen that the query time increases with the dimesional-ity and maintenance time also increases. The percentageincreases in maintenance time is more compare to querytime.

Figure 3: Query time and Maintenance time

At low dimensions maintenance time is quite less incomparision to the query time, but at high dimensions itis comparable to the average query time. This is the ma-jor bottleneck of the proposed AutoXtree. Optimizationin the maintenance time can improve the overall perfor-mance of AutoXtree.

One important observation about AutoXtree is that it getsbetter and better after queries. This is the self-evolvingproperty of this index structure. Figure 4 shows the varia-tion of average query time with number of queries. Aftersome points the curve seems to get saturated and after thispoints there is hardly any new supernode gets formed.

Figure 4: Variation of average query time with number ofqueries

This result is what we were expecting. Since we arerunning the queries of similar type, the index structureadapts ifself for them and give very good performance af-terwards.One question arises how to set the value of the thresholdχ. We can use the ratio of time of random I/O and sequen-tial I/O to decide what is a good value of threshold χ. Weexperimented with different values of χ.

Figure 5: variation of query time with threshold

When the threshold is low, many of the nodes will bemerged and supernode size will be very high. There issome catch with linux file system, which is explainedlater, due to which large files are not stored sequentially.

3

There will any way a random I/O to read the full file ev-erytime. Also there is a chance that we are unnecessarilywasting the time in large sequential I/O rather than savingthe cost of one random I/O. Optimal value of χ shouldbe calculated from the ratio of random I/O time andsequential I/O time. One more improvement (which isalso suggested at the end) is to make an implementationwhich is independent of the file manager.

Finaly we compared the performance of AutoXtreewith R-tree and X-tree with different different dimen-sional data. The results are displayed through Table 1 andFigure 6.

Dimension R-tree X-tree AutoXtree2 23.8 26 30.65 29.5 32.2 34.77 46.3 44.1 55.89 55.1 51.8 61.2

10 63.0 55.5 62.120 123.6 80.2 111.130 164.0 113.8 140.250 412.1 145.0 240.2

Table 1: Average Timings

Figure 6: comparision of R-tree, X-tree and AutoXtree

The relative performances of R-tree and X-tree weresame as expected. R-tree fails to perform well at di-mensions. X-tree performs better than R-tree at highdimesnions. AutoXtree’s performance is almost sameas R-tree and X-tree when the dimensionality is verylow. Which proves our claim that AutoXtree behavesalmost same as an R-tree at low dimensions. Becausethe supernodes formation will be very less. But whenthe dimensions are high there is a huge overhead oftree maintenance which results in poor performance ascompare to the X-tree. But still its performs better than

R-tree, as in R-tree there is a high number of randomI/O operations takes place, and since we have formed thesupernodes the number of random I/O is very less.We also performed the experiments with different datadistributions like gaussian and exponential. But theresults were very similar to these results.

4 ConclusionsWhen we did not find any significant improvement dueto this inex structure we tried to find out the reasons andconcluded the following points

1. Linux file system may fragment long files intosmaller chunks to store. Which can harm the per-formance of the AutoXtree. Most of the Databasesystems implementation are independent of the filesystem. If we do a similar kind of implementation, itis expected to improve the performance

2. The maintenance part of the AutoXtree is bottleneckas the time of maintenance gets much higher for highdimensional data. Improving this part will signifi-cantly improve the performance of AutoXtree.

3. We did the profiling of our code and analyzed whichpart were consuming more amount of time. Wefound out some parts which can be optimized. Thismight also lead to get a better performance of Au-toXtree.

4. The useful result of this index structure we observedis that it’s average query time improves over thequeries.

4.1 Future WorkWe will try to make an file system independent implemen-tation. We will do the necessary optimizations needed inthe implementation. We will try to optimize the part ofmaintaining the index structure after every query by batchprocessing the updation part after a batch of queries or trysome other possibilities.

ReferencesBerchtold, S., Keim, D. A., and Kriegel, H. P. (2001).The X-tree: An index structure for high-dimensionaldata. Readings in multimedia computing and networking,451.

4

xtree

Documents