[ieee 2009 29th ieee international conference on distributed computing systems (icdcs) - montreal,...

8
Two-Tier Air Indexing for On-Demand XML Data Broadcast Weiwei Sun # , Ping Yu # , Yongrui Qing # , Zhuoyao Zhang # , Baihua Zheng * # School of Computer Science, Fudan University, Shanghai, China {wwsun, yuping, yrqin, zhangzhuoyao}@fudan.edu.cn * School of Information Systems, Singapore Management University, Singapore [email protected] Abstract XML data broadcast is an efficient way to disseminate semi-structured information in wireless mobile environments. Air indexing is the common method to improve the access time, and reduce the energy consumption in a broadcast system. In this paper, we propose a novel two-tier air indexing method that provides an overall picture of the document set in the server which is necessary for XML data retrieving in on-demand mode. The efficiency of our indexing method is contributed by two distinct advantages. First, the proposed pruning technique and the two-tier structure significantly reduce the index size. Second, the two-tier structure enables efficient access protocol at the client which can further reduce the tuning time during the index look up. Simulation experiments show the benefits of our indexing methods. 1. Introduction With the rapid development of wireless network technologies and wireless communications, the vision of pervasive computing is getting closer to a reality. Today, there are many wireless technologies (e.g., Bluetooth, WiFi, UMTS, Satellite, etc.) that could be integrated into a seamless, pervasive information access platform. Logically, information access via these wireless technologies can be classified into two basic ways: point-to-point access and broadcast. Point-to-point access employs a pull-based approach where a mobile client initiates a query to the server which in turn processes the query and returns the result to the client over a point-to-point channel. It is suitable for lightly loaded systems in which wireless channels and server processing capacity is not severely contended. Differently, wireless broadcast requires the server to proactively push data to the clients over a dedicated broadcast channel. It allows an arbitrary number of clients to access data simultaneously and thus is particularly suitable for heavily loaded systems. Re- cently, there has been a push for such systems from the industry and various standard bodies. For example, born out of the International Telecommunication Union’s (ITU) International Mobile Telecommunica- tions "IMT-2000" initiative, the Third Generation Partnership Project 2 is developing Broadcast and Multicast Service in CDMA2000 Wireless IP network. On the other hand, more information has been available as semi-structured or unstructured data over the past few years. XML holds the potential to become the de facto standard for data integration. There has been a large body of research work focusing on XML in recent years, such as XML filtering, querying and indexing [1, 3, 4, 5]. In this paper, we focus on the access of XML documents in wireless on-demand broadcast systems. To be more specific, the system consists of three parts: 1) the communication mechanism; 2) the broadcast server; and 3) the mobile clients. Fig. 1 shows a high- level overview of the system. The main purpose is to deliver XML documents according to user requests efficiently. We assume that the users submit their queries in the form of XPath queries to the server via uplink channel, and then wait for the result documents via listening to the broadcast channel. The server, after receiving user queries, identifies the result documents and calls the scheduler to arrange the broadcast program based on different scheduling algorithms. Figure 1. A general on-demand broadcast model In the literature, there are some existing works on indexing XML documents to support wireless broadcast [2, 10]. The main strategy is to build an index for each XML document and then broadcast the index together with XML document to facilitate the retrieval of that XML document. However, this 2009 29th IEEE International Conference on Distributed Computing Systems 1063-6927/09 $25.00 © 2009 IEEE DOI 10.1109/ICDCS.2009.42 199

Upload: baihua

Post on 16-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

Two-Tier Air Indexing for On-Demand XML Data Broadcast

Weiwei Sun#, Ping Yu#, Yongrui Qing#, Zhuoyao Zhang#, Baihua Zheng* #School of Computer Science, Fudan University, Shanghai, China

{wwsun, yuping, yrqin, zhangzhuoyao}@fudan.edu.cn *School of Information Systems, Singapore Management University, Singapore

[email protected]

Abstract

XML data broadcast is an efficient way to disseminate semi-structured information in wireless mobile environments. Air indexing is the common method to improve the access time, and reduce the energy consumption in a broadcast system. In this paper, we propose a novel two-tier air indexing method that provides an overall picture of the document set in the server which is necessary for XML data retrieving in on-demand mode. The efficiency of our indexing method is contributed by two distinct advantages. First, the proposed pruning technique and the two-tier structure significantly reduce the index size. Second, the two-tier structure enables efficient access protocol at the client which can further reduce the tuning time during the index look up. Simulation experiments show the benefits of our indexing methods.

1. Introduction

With the rapid development of wireless network technologies and wireless communications, the vision of pervasive computing is getting closer to a reality. Today, there are many wireless technologies (e.g., Bluetooth, WiFi, UMTS, Satellite, etc.) that could be integrated into a seamless, pervasive information access platform. Logically, information access via these wireless technologies can be classified into two basic ways: point-to-point access and broadcast.

Point-to-point access employs a pull-based approach where a mobile client initiates a query to the server which in turn processes the query and returns the result to the client over a point-to-point channel. It is suitable for lightly loaded systems in which wireless channels and server processing capacity is not severely contended.

Differently, wireless broadcast requires the server to proactively push data to the clients over a dedicated broadcast channel. It allows an arbitrary number of clients to access data simultaneously and thus is particularly suitable for heavily loaded systems. Re-

cently, there has been a push for such systems from the industry and various standard bodies. For example, born out of the International Telecommunication Union’s (ITU) International Mobile Telecommunica-tions "IMT-2000" initiative, the Third Generation Partnership Project 2 is developing Broadcast and Multicast Service in CDMA2000 Wireless IP network.

On the other hand, more information has been available as semi-structured or unstructured data over the past few years. XML holds the potential to become the de facto standard for data integration. There has been a large body of research work focusing on XML in recent years, such as XML filtering, querying and indexing [1, 3, 4, 5].

In this paper, we focus on the access of XML documents in wireless on-demand broadcast systems. To be more specific, the system consists of three parts: 1) the communication mechanism; 2) the broadcast server; and 3) the mobile clients. Fig. 1 shows a high-level overview of the system. The main purpose is to deliver XML documents according to user requests efficiently. We assume that the users submit their queries in the form of XPath queries to the server via uplink channel, and then wait for the result documents via listening to the broadcast channel. The server, after receiving user queries, identifies the result documents and calls the scheduler to arrange the broadcast program based on different scheduling algorithms.

Figure 1. A general on-demand broadcast model

In the literature, there are some existing works on indexing XML documents to support wireless broadcast [2, 10]. The main strategy is to build an index for each XML document and then broadcast the index together with XML document to facilitate the retrieval of that XML document. However, this

2009 29th IEEE International Conference on Distributed Computing Systems

1063-6927/09 $25.00 © 2009 IEEE

DOI 10.1109/ICDCS.2009.42

199

Page 2: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

approach is not suitable for on-demand environment: (1) The client has to continuously monitor the

broadcast channel, because existing approaches do not provide the client with any knowledge of how many documents actually satisfy his current request.

(2) Existing approaches build the index inside a document which incurs a large document size 1 and prolongs the broadcast cycle and hence weakens the adaptability of on-demand mode. This is because if the queries’ access pattern changes during the broadcast-ing of the cycle, the system can not make response to this change.

Motivated by the inefficiency of existing ap-proaches, we, in the paper, propose a novel two-tier index based on the combined DataGuides [5] to facili-tate the dissemination of XML document in wireless broadcast environments. The main contributions of this work are summarized as follows.

(1) We propose a two-tier indexing method to enable energy-efficient XML documents retrieval for on-demand broadcast mode. With the pruning techniques, it can significantly reduce the index size.

(2) We propose an improved access protocol at the client for the two-tier indexing method. It is proved to have a shorter tuning time during the index look up process in on-demand broadcast mode.

The remainder of the paper is organized as follows. Section 2 presents the preliminaries of the problem. Section 3 explains our two-tier indexing method as well as the improved access protocol. Section 4 details the simulation model, together with the performance evaluation results. Finally, Section 5 concludes our work.

2. Preliminaries 2.1. On-demand data broadcast

We adopt on-demand broadcast in this paper to improve the usability of the wireless channel. The reasons include:

(1) On-demand mode can adapt to the change of client’s access pattern, since the organization of each broadcast cycle is dynamically determined by the server based on the current pending queries.

(2) On-demand mode can utilize the scarce band-width more efficiently because the users are usually only interested in a small portion of the large XML documents, and documents not requested by users will not be broadcast in this mode.

1 In [2], the smallest index size is close to 100% of the total data size while our index size can be reduced to 0.1%~0.5%.

The server accumulates the user requests (XPath queries) at a local queue and answers part of pending queries in each broadcast cycle based on the scheduling strategy. Without loss of generality, we assume that a result set of an XPath query contains one or multiple XML documents and an XML document might satisfy multiple XPath queries, the result set for each request is not empty and the user aims at retrieving all the result documents.

2.2. Performance metrics of data broadcast

As mobile clients are typically powered by batteries with limited capacity, the main concerns of wireless broadcast system include access efficiency and energy consumption of mobile clients. Correspondingly, the access time and tuning time are employed as the major performance metrics: • Access Time: The time elapsed from the

moment a query is issued to the moment it is answered.

• Tuning Time: The time a mobile client stays in active mode to receive the requested XML documents and index information.

2.3. Air indexing

Air indexing techniques are often used for

conserving the energy of mobile clients [6, 9, 12]. The basic idea is that the broadcast server pre-computes indexing information (including searchable attributes and delivery time of data objects) and interleaves it with data objects (e.g., XML documents) on the broadcast channel. Through the air index, mobile clients are aware of the arrival time of desired data objects by examining the index information. Long before the arrival of the objects, the clients can switch to doze mode and only wake up to active mode when the objects are about to be broadcast. Similarly, appending to each data object, the delivery time of the next index helps the clients to schedule the sleep time for sub-sequential data access. Consequently, the search of data objects is facilitated.

For XML data broadcast in on-demand mode, index is essential because without any auxiliary information about the query result, a client is forced to exhaustively listen to the wireless channel for requested documents. For instance, to retrieve all the documents satisfying /a/b/c, a client has to download the whole document set and does the query processing locally. This clearly consumes a lot of energy since the client has to stay in active mode all the time.

With the help of air indexing, the client can access the data objects in three steps:

200

Page 3: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

• Initial probe: the client tunes into the channel to find out the broadcast time of the next index;

• Index search: the client tunes into the channel to retrieve the index and locate the result documents (i.e., find out the broadcast time of the result documents);

• Document retrieval: the client tunes into the channel when the result documents are to be broadcast to download them.

3. The two-tier indexing method

As pointed out in Section 1, existing approaches cannot provide efficient access of XML documents in on-demand wireless broadcast systems. In order to tackle this issue, we propose a novel two-tier indexing method which is based on the combined DataGuides [5]. In this section, we first introduce the concept of combined DataGuides which captures the path information of all the documents. Then, we define the pruning techniques to significantly reduce the index size in an on-demand environment, and present a novel two-tier structure that can further reduce the index size and bring a better tuning time. 3.1. The basic index structure

DataGuides [5] is a concise, accurate, and conven-ient summary of the structure of XML documents. It describes every unique-label path of a document exactly once. A DataGuide can be considered as a signature of the XML document. Unlike conventional signature indexes [12], DataGuides is accurate. It can be used to represent all the structural information of the original XML data. Fig.2 gives an example of the data in the server. It contains 5 documents, i.e., d1, d2, d3, d4, and d5, with their detailed document structure. We assume that there are six user queries submitted, i.e., q1, q2, q3, q4, q5 and q6, as shown in Fig. 2(b)

Fig. 3(a) plots the corresponding DataGuides for the sample XML documents defined in Fig. 2(a). Take document d1 as an example. It is noticed that each unique path (paths with the same label sequence from the document root) of the original XML documents is recorded only once within DataGuides, thus it is concise and accurate for single path queries.

As each DataGuide is built for one single XML document, we adopt RoxSum [11] to integrate DataGuides of different XML documents into one structure to build a Compact Index (CI). Fig. 3(b) shows the integrated version of all the DataGuides related to the sample XML documents. Compared with the approach using individual index for each document, this compact version takes advantage of the path

sharing among a set of XML documents, thus significantly reduce the index size.

(a) Sample document set

Q Matched documents ID list q1:/a/b/a d1, d2 q2:/a/c/a d4, d5 q3:/a//c d2, d3, d4, d5 q4:/a/b d1, d2, d3, d5 q5:/a/c/* d2, d4, d5 q6:/a/c/a d4, d5

(b) Sample user query set

Figure 2. A running example

There are three kinds of nodes in CI, i.e., root nodes, internal nodes, and leaf nodes. The corresponding node structure is depicted in Fig. 3(c), which can be decomposed into three blocks. The first block contains a flag to indicate the type of the corresponding index node. For example, a leaf node has flag = 1, an internal node has flag = 0, and a root node has flag set to the real index value. The second block is <entry, pointer>, carrying the child node information. The last block is made up of <doc, pointer> tuples for real XML documents. A node might not necessarily have all three blocks. For example, a leaf node only has flag and <doc, pointer> tuples, while an internal node usually only has flag and <entry, pointer> tuples. Note that, an internal node of CI may also have <doc, pointer> tuples besides <entry, pointer> tuples, e.g., n3 in Fig.3(c).

Back to our previous example defined in Fig. 2. First, we locate the result documents for each query defined in Fig. 2(b), and build one DataGuide to serve all six queries, as shown in Fig. 3(b). Obviously, the CI consists of nine nodes, denoted as n1…n9 in Fig. 3(b). Take query q1 as an example. It asks for documents that satisfy the XPath query /a/b/a. Suppose that the client knows the start of the index and tunes into the channel when n1 is to be broadcast. It first retrieves n1. According to the next entry information, it follows pointer to access node n2 and then n4. As

201

Page 4: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

node n4 is a leaf node, the result documents are located, i.e., d1 and d2. The client can then tune to doze mode for energy saving and tune into the channel only when d1 or d2 is to be broadcast. The query can be satisfied when all the result documents are downloaded.

(a) DataGuides for sample documents

(b) CI for sample documents

(c) Node structure

Figure 3. An example of CI structure

Fig. 4 shows the index organization on air. However, in reality, broadcast server partitions the data into packets with fixed size such as 128 byte/packet and all the data retrieval is in the unit of packet. As each client only needs to access a portion of the entire CI that matches his queries, we try to pack the adjacent nodes together in order to reduce the number of index packets a client has to access (i.e., minimizing the tuning time). In other words, CI adopts a greedy packing algorithm. The nodes are sorted according to the depth-first order, maintained in a stack. The head node is always pop up first and put into the current packet. When the current packet does not have enough free space to accommodate the new node, a new packet is initialized.

Back to our example. According to the depth-first order, the nine nodes will be accessed in the order of n1, n2, n4, n5, n6, n3, n7, n8 and n9. Assume the size of each node in terms of byte is listed in Fig. 5, with 2 bytes assigned to flag, and 8 bytes assigned to entry, pointer, and doc respectively. If each packet has 128 bytes, the whole index nodes will be packed into four packets, as shown in Fig. 5. Rather than downloading the entire

index, clients only need to access packet P1 to answer q1, and only access packets P1 and P2 to answer query q4.

Figure 4. An example of one-tier CI

Figure 5. Partitioning of CI

3.2. Index pruning

It is noticed that original CI is built based on the entire document collection. However, in on-demand broadcast mode, the requested document set is much smaller than the original collection and many index nodes are actually not necessary. Motivated by this observation, we propose a pruning strategy to eliminate all the dead nodes that are not accessed by any query.

Instead of considering all the documents, we post-process the CI with steps listed below. First, a DFA is built based on the set of queries Q pending at the server side. Second, each node in CI is checked to see whether the node matches any query in Q. If a match is found, the node is marked. Finally, a new pruned CI index is built using only those marked nodes which is named PCI in this paper.

For example, assume the query set Q pending at the server side is {/a/b, /a/b/c}. The original CI (shown in Fig. 3(b)) will be simplified, as shown in Fig. 6. It is noticed that only nodes n1, n2, and n5 match the queries, and the rest nodes are discarded.

Figure 6. An example of PCI

Similarly, the broadcast of the XML documents is triggered by the user requests. If a document is never

202

Page 5: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

requested, it will not be broadcast. Furthermore, pruning is transparent to clients, that is, there is no change in the client access protocol.

3.3. The two-tier index structure

As we mentioned in Section 2.1, an XML document might satisfy multiple queries and hence it is included in the result document sets. Take document d2 as an example. It matches the XPath a/b/a, a/b/c and a/c/b and hence its pointer appears three times in CI index (see Fig. 3). Given a large number of queries, some documents might be requested multiple times and hence their pointers will be duplicated multiple times. In order to reduce the duplication, we propose a two-tier index structure which removes the pointers to the result documents from the current index structure and packets them into a second-tier document pointer list.

In other words, the client needs listen to two-tier indexes in order to locate the result documents. Take a client who issues q1 as an example. First, it needs to retrieve the first-tier index to find the IDs of all the result documents, i.e., d1 and d2, and then it listens to the second-tier to find out when the result documents, i.e., p1 and p2, will be broadcast. Fig. 7 depicts the new placement of the PCI index of the running example, with its one-tier structure shown in Fig. 4 / Fig. 5.

Figure 7. Two-tier structure

The two-tier structure can also help to reduce the index size. From the point of view of the Relational Database Normalization Theory, the two-tier structure eliminates the redundancy information that exists in the one-tier structure and it is the normalized form of the one-tier structure. To be more specific, one-tier structure corresponds to the schema S1(node, doc.id, doc.offset), and the two-tier structure corresponds to the schema S2_1(node, doc.id) and S2_2(doc.id, doc.offset). Obviously, S1 is in 1NF and the other two are in BCNF that eliminates redundancy caused by function dependency in 1NF [12].

3.4. Access protocol at the client Once the index is built, the next issue is how to

interleave the index with the XML documents in the wireless broadcast program. A simple approach is to broadcast the two-tier index before delivering the XML documents. In our approach, a complete index (PCI) is broadcast as the first-tier index in the head of each broadcast cycle, followed by the second-tier index (offset list) including the offsets of the documents broadcast in the current cycle. It is noted that the first-tier index is built based on the entire XML document set which enables the clients to locate the result documents via lookups inside one index. Fig. 8 depicts a general broadcast program organization. Notations LI, LO, LD are described in Table 1.

Table 1 Notations for two-tier index Notation Description LI the size of the first-tier index LO the size of the second-tier index LD the size of the documents broadcast

in a cycle

Since the first-tier index is built upon the entire document collection, the clients can record all the result document IDs once they access this index. Such ID information together with the offset information carried in the second-tier index are sufficient for the client to locate the result documents. As a result, if the request can not be satisfied within the current cycle (i.e. all the result documents are not included in the current broadcast cycle), the client only need to download the second-tier index in the following cycle to locate the arrival time of the requiring documents and download them thus significantly reduce the tuning time. Based on such observation, we design an improved access protocol for clients which can significantly reduce the tuning time during the index look up. The client retrievals the result documents in the following steps:

• Initial probe: the client sends requests to the server and tunes into the channel to find out the broadcast time of the next index;

• First-tier index search: the client tunes into the channel to retrieve the first-tier index and record the result document IDs;

• Second-tier index search: the client tunes into the channel to retrieve the second-tier index and locate the result documents;

• Document retrieval: the client tunes into the channel when the result documents are to be broadcast to download them.

It should be noted that the second-tier index search and document retrieval step will repeat when the request can not be satisfied within one broadcast cycle.

203

Page 6: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

Figure 8. Broadcast program organization

To be more specific, under this access protocol, each client only needs to listen to one first-tier index and n second-tier index, with n representing the number of cycles needed to broadcast all the result XML documents to the query that the client is issuing. Consequently, the tuning time performance under this two-tier index distribution strategy, denoted as TTCI, is analyzed in Equation (1)

O

CI ITT L n L

the required documents download time= + ⋅

+ (1)

4. Experiments and evaluation

As described in Section 1, the existing approaches can not be applied to on-demand XML data broadcast because they do not provide an overall picture of the document set available on the wireless channel; otherwise, mobile clients have to continuously monitor the broadcast channel. Therefore, we only study the performance of our approaches in this section.

The evaluation is consisted of two parts. In first part, we evaluate the effectiveness of our indexing method including the pruning techniques and the two-tier structure. We present the effectiveness of our access protocol for the two-tier structure in the second part.

4.1. Environment setup

Similar to existing works [1, 3], we use simple

XPath queries in our experiments. A simple query can be expressed as a path expression of format

P = /N | //N | PP, N = E | * Here E is the element label, / denotes the child axis,

// denotes the descendant axis, and * is the wildcard. In our simulation, we use one synthetic data set,

News Industry Text Format (NITF) DTD and XML Documents, generated using the IBM’s Generator tool [4]. We implement the modified version of the generator [3] to generate synthetic XPath queries without predicates. YFilter [3], although other XML stream filtering techniques are also applicable, is used here to filter the documents and generate the matched documents ID list for each query. In total, 1000 documents are generated to serve as the document set. In addition, we evaluate the performance of our

approaches using another document set (NASA). As the findings are pretty much the same, we omit the result of that document set to save space.

We adopt the algorithm proposed in [8] as underlying scheduling algorithm. The average length of a broadcast cycle is 1000KB. We allocate 2 bytes to represent an ID of an XML document and 4 bytes to represent a pointer. The size of XML documents varies, with 10KB as the average size. Table 2 summarizes the system parameters.

Table 2 Experimental setup Variable Description Default

value NQ The number of queries submitted

to the server during the broadcasting period of each cycle

500

P The probability of wildcard * and double slash // in queries

0.1

DQ maximum depth of queries 10 It is noted that for a given scheduling algorithm, the

broadcast of XML document is independent on the index structure. In other words, no matter which kind of index is adopted, the time used to retrieve the documents is constant. Consequently, in the following simulations, we only present the tuning time incurred during index loop up without considering the document retrieval. In addition, we assume the bandwidth is constant, and use the number of bytes retrieved/broadcast to represent the tuning time performance. Consequently, the unit of tuning time represented in the simulation result is byte.

4.2. Evaluation of indexing method

We have proposed two techniques to improve the

proposed index structure, i.e., index pruning that prunes away non-requested documents and two-tier structure that removes the duplicated pointers to the XML documents. In our first set of experiments, we evaluate the effectiveness of these two techniques.

1) Effectiveness of index pruning Fig.9(a) shows index size with different NQ.

Compared with the original XML document set, CI is only about 1.5% of the size. Besides, the pruning skill

204

Page 7: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

200 400 600 800 10000

50

100

150

200

NQ

size

CIPCI

0 0.1 0.2 0.30

50

100

150

200

P

size

CIPCI

6 8 10 12 140

50

100

150

200

DQ

size

CIPCI

(a) (b) (c)

Figure 9. Effect of index pruning can reduce the size further and the size of PCI on average is about 9% of that under CI. The size of CI remains constant in the experiment because CI is built on the document set which is independent of the query number. On the other hand, the size of PCI increases slightly as NQ grows. That is because the effectiveness of pruning process will deteriorate due to the increase of query number. Specifically, the more pending queries remain in the server, the more nodes of the CI are matched which leaves less space for pruning.

Fig.9(b) presents the index size under different P. We find that the size of CI remains unchanged with the increase of P because it is mainly determined by the document set and hence is independent of the queries. On the other hand, the size of PCI is proportional to P because with a higher possibility of * and // in the query set, more documents will be included in the result set which deteriorates the efficiency of the index pruning.

Fig. 9(c) depicts the performance with different DQ. When DQ increases, the index sizes of both CI and PCI get shrunk. The reason is that, a larger DQ implies a smaller query selectivity which in turns results in a smaller result set for each query. On average, PCI can save at least 30% of CI’s size, in most, if not all, the cases.

2) Effectiveness of two-tier structure on index size In this part, we evaluate the savings in terms of

index size brought by two-tier structure in Fig. 10. It is observed that two-tier structure significantly reduces the index size in our simulation environments, which implies a better access time and tuning time.

A small index means a short broadcast cycle and short access time. From the above experimental results, we can prove the effectiveness of both the pruning technique and the two-tier structure. The final index size can be reduced to 0.1%~0.5% compared with the data size.

one-tier index two-tier index0

5

10

15

20

25

size

Figure 10. Comparison of index size

3) Effectiveness of two-tier structure on tuning time In this part, we evaluate the savings in terms of the

tuning time brought by the two-tier indexing with the improved access protocol at the client. Since the tuning time for document retrieval is independent of the indexing approach, we only discuss the tuning time during the index look up in this part. We adopt the scheduling method proposed in [8] to generate the broadcast program. On average, each client has to listen to 11.8 broadcast cycles to complete one query. The tuning time performance of one-tier scheme and two-tier scheme under different NQ, P, and DQ is depicted in Figure 11.

Our first observation is that two-tier scheme outperforms one-tier scheme significantly. The main reason is that with the two-tier scheme, clients only need to access the first-tier index once which reduces the tuning time for index retrieval. The second observation is that parameters NQ, P, and DQ have a less significant impact on two-tier scheme which is much more stable, compared with one-tier scheme. The reason behind is that clients need to download the first-tier index once under the two-tier scheme and then listen to the second-tier index. As the broadcast cycle is fixed, the number of documents included into

205

Page 8: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

200 400 600 800 10000

100

200

300

400

NQ

tuni

ng ti

me

one-tiertwo-tier

0 0.1 0.2 0.30

100

200

300

400

500

P

tuni

ng ti

me

one-tiertwo-tier

6 8 10 12 140

100

200

300

DQ

tuni

ng ti

me

one-tiertwo-tier

(a) (b) (c)

Figure 11. Tuning time under one-tier vs. two-tier structures

each cycle is fixed and hence the second-tier index corresponding to each cycle is constant, which explains the relatively stable performance under two-tier scheme. 5. Conclusions In this paper, we present a wireless on-demand broadcast system to support XPath queries and provide XML document retrieval. A two-tier indexing method is proposed. It is based on the DataGuides which is very easy for implementation. With the pruning technique, the index size can be reduced to a large extent. The two-tier index structure is proposed which can not only further reduce the index size but also achieve a significant better tuning time. We also propose an improved access protocol at the client which shows special advantages in on-demand environment with the two-tier index structure. All our approaches demonstrate effectiveness and efficiency in our simulations. In the near future, we plan to study the impact of user query pattern on the system performance and develop a prototype to demonstrate the practical power of our proposed two-tier indexing method. Acknowledgement

This research is supported in part by the National

Natural Science Foundation of China (NSFC) under grant 60503035, and the National High-Tech Research and Development Plan of China under Grant 2006AA01Z234. References

[1] K. Candan, W. Hsiung, S Chen, J. Tatemura, D. Agrawal. AFilter: adaptable XML filtering with prefixcaching suffix-clustering. In VLDB 2006.

[2] Y. Chung, J. Lee. An indexing method for wireless broadcast XML data. Information Sciences. 177(9). 1931–1953 (2007).

[3] Y. Diao, M. Altinel, M. Franklin, H. Zhang, P. Fischer. Path sharing and predicate evaluation for high-performance XML filtering. ACM Transactions on Database Systems. 28(4): 467-516 (2003).

[4] A. Diaz, and D. Lovell. XML Generator. http://www.alphaworks.ibm.com/tech/xmlgenerator.

[5] R. Goldman, and J. Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In VLDB 1997.

[6] T. Imielinski, S. Viswanathan, and B. R. Badrinath. Data on Air: Organization and Access. IEEE Transactions on Knowledge and Data Engineering. 9(3):353–372(1997).

[8] G. Lee, S. Lo. Broadcast Data Allocation for Efficient Access of Multiple Data Items in Mobile Environments. ACM/Kluwer Journal of Mobile Networks and Application. 8(4): 365-375 (2003).

[9] W.-C. Lee, B. Zheng. DSI: A Fully Distributed Spatial Index for Wireless Data Broadcast. In ICDE 2005.

[10] S. Park, J. Choi, S. Lee. An Effective, Efficient XML Data Broadcasting Method in a Mobile Wireless Network. In DEXA 2006.

[11] A. Silberschatz, H.F Korth, S. Sudarshan. Database System Concepts (5th Edition), McGraw-Hill 2006 [13] Z. Vagena, M. Moro, V. Tsotras. RoXSum: Leveraging Data Aggregation and Batch Processing for XML Routing. In ICDE 2007.

[12] J. Xu, D. L. Lee, Q. Hu, and W.-C. Lee. Data Broadcast. Handbook of Wireless Networks and Mobile Computing, Chapter 11, Ivan Stojmenovic, Ed., New York: John Wiley & Sons, ISBN 0-471-41902-8, Jan. 2002, pp. 243-265.

206