[ieee 2011 international conference on parallel processing (icpp) - taipei, taiwan...

P2P Object Tracking in the Internet of ThingsYanbo Wu, Quan Z. Sheng, Damith Ranasinghe

School of Computer ScienceThe University of AdelaideAdelaide, 5005, Australia

{yanbo, qsheng, damith}@cs.adelaide.edu.au

Abstract—With recent advances in technologies such as radio-frequency identification (RFID) and new standards such asthe electronic product code (EPC), large-scale traceability isemerging as a key differentiator in a wide range of enterpriseapplications (e.g., counterfeit prevention, product recalls, andpilferage reduction). Such traceability applications often needto access data collected by individual enterprises in a distributedenvironment. Traditional centralized approaches (e.g., data ware-housing) are not feasible for these applications due to their uniquecharacteristics such as large volume of data and sovereignty ofthe participants. In this paper, we describe an approach thatenables applications to share traceability data across independententerprises in a pure Peer-to-Peer (P2P) fashion. Data are storedin local repositories of participants and indexed in the networkbased on structured P2P overlays. In particular, we present ageneric approach for efficiently indexing and locating individualobjects in large, distributed traceable networks, most notably, inthe emerging environment of the Internet of Things. The resultsfrom extensive experiments show that our approach scales wellin both data volume and network size.

Keywords: Internet of Things, traceable networks, RFID, Peer-to-Peer systems, scalability.

I. INTRODUCTION

Recent advances in RFID (radio-frequency identification)technologies, wireless sensors, and Web services have ledto the emergence of the “Internet of Things”, a global net-work where everyday objects such as buildings, sidewalks,and commodities are identifiable, readable, addressable, andeven controllable via the Internet [10], [22], [32]. Such anetwork can provide real-time information on the movementsof objects. It is essentially a “data-on-network” system, whereRFID tags contain an unambiguous ID and other data per-taining to the objects can be stored and accessed over theInternet. This makes automatic tracking and tracing possiblein large-scale applications (e.g., nation-wide supply chainmanagement across companies including product recalls andanti-counterfeiting). Traceability applications analyze automat-ically recorded identification events to discover the currentlocation of an individual item. They can also retrieve historicalinformation, such as previous locations, transportation timebetween locations, and time spent in storage.

Central to realizing these traceability applications is themanagement of data, particularly on how to share the data thatare typically collected by individual enterprises. An obvioussolution is to publish all data collected within each organiza-tion to a central data warehouse. Unfortunately, this approachhas several severe drawbacks. Firstly, object movement and

related data are valuable information that companies may bevery reluctant to put their data in a shared central warehouse.Secondly, such an approach has very limited scalability andis not feasible for the environment of the Internet of Thingswhere the amount of data collected could be enormous (e.g.,from hundreds of thousands of RFID-tagged objects in asupply chain) [8], [23], [30].

An alternative would be a peer data management system(PDMS) that allows data to be stored in local repositoriesat each organization and makes those repositories accessiblein a Peer-to-Peer (P2P) fashion [12], [33]. P2P computingis gaining a significant momentum since it exploits the dis-tributed nature of the Internet. A recent representative effortin this direction is IBM’s Theseos [6] that enables traceabilityapplications to process complex queries across organizationsin a completely distributed setting. Theseos relies on a noveltraceability data model that eliminates any data dependenciesbetween organizations, which serves as a global schema thatallows the formulation of a query without knowledge on howthe data is stored, where it is located, and how a tracking queryis executed [1]. To improve the performance of traceabilityquery processing, each organization is required to maintainthe information on the movement path of an object. Withthis information, it is possible to minimize the number ofnodes to be visited without flooding queries to all nodes inthe network. Unfortunately, to obtain this information, Theseosrequires high synchronization with other enterprise data (e.g.,billing or accounting information). This is impractical formany applications where such enterprise data may not beconveniently available.

In this paper, we propose a generic approach for efficienttracing and tracking of objects in large-scale, distributedenvironments such as the Internet of Things. In particular, byanalyzing a wide range of traceability applications, we abstractthe concept of traceable networks and a new model for movingobjects in discrete spaces. Our approach is built on top of theDHT (Distributed Hash Table) based overlay network [2]. Weindex an object and its latest state at a deterministic node calledgateway node. Every time the object moves from a source nodeto a destination node, the gateway node updates object status atthe source and the destination nodes of its movement, therebyestablishing the information on movement path of the object.Since gateway nodes are randomly chosen in an anonymousway, the sovereignty of the participating organizations is notaffected. To reduce the indexing overhead from large volume

2011 International Conference on Parallel Processing

0190-3918/11 $26.00 © 2011 IEEE

DOI 10.1109/ICPP.2011.14

502

data in traceability applications, we further propose an en-hanced group-based indexing mechanism. Instead of indexingeach object individually, we group the objects that arrive at anode within a timeframe by the prefix of their hashed ids andindex these groups. We propose algorithms to determine theoptimal prefix length by considering both indexing cost andload balancing. Extensive experiments have been conducted toshow the viability and scalability of our proposed techniques.

The remainder of the paper is organized as follows. Sec-tion II briefly introduces traceable networks and a movingobject model. Section III describes a P2P approach for efficientprocessing of traceability queries in a DHT-based overlaynetwork. Section IV presents an enhanced indexing approachdealing with massive volumes of data in large-scale traceabil-ity applications. Section V reports the results of the evaluationof the proposed techniques. Finally, Section VI gives anoverview of the related work and Section VII provides someconcluding remarks.

II. PRELIMINARIES

In this section, we will briefly introduce traceable networksand a new moving object model.

A. Traceable Networks

Figure 1 gives a high level description of traceable networks.There are three main entities in this network: node, receptor,and object. Each node governs a number of receptors (e.g.,RFID readers) that are used to capture the objects. Receptorsare deployed at fixed locations (e.g., entry of a warehouse).Nodes are logical and abstracted to represent partners in thenetwork. For example, in a supply chain network, a node maybe a distribution center or a retail store, and the receptors areRFID readers deployed at the entrances and exits. Objects aregoods attached with RFID tags.

Physically, objects move along a certain trajectory (e.g., apallet moves in a supply chain). We call this the physicalobject flow. After the objects are captured, their movementsare digitalized and the captured information is called theinformation flow. Receptors are the medium to transform thephysical object flow to digital information flow. Nodes storethe information flow segments which occur inside their ownterritory. Nodes are connected (e.g., via the Internet) so thatthey can exchange information with other nodes in order toget the whole picture of an object’s trajectory.

It should be noted that we assume in this paper that thedata captured by receptors (e.g., RFID readings) is alreadycleansed. Sensor data cleansing is an important research topic,which has attracted lots of efforts from researchers in last fewyears. Interested readers are referred to [5], [11], [13].

B. A Model for Moving Objects in Discrete Space

In existing models for moving objects, both time andspace domains are continuous. This is necessary because thequery result domain is also continuous. However, in manytraceability applications (e.g., RFID-enabled supply chains),objects are only captured nodally at fixed locations. The

NODES

RECEPTORS

OBJECTS

Internet

Past Present Future

��

��

��

��

��

��

Fig. 1: A High Level Description of Traceable Networks

queries refer to a finite set of fixed locations instead of apoint or region in an infinite space. The traditional continuousmodels can be adopted in these applications. However, they areunnecessarily heavy because of the definition and maintenanceof various spatial elements. We therefore define a new modelthat can represent moving objects and their attributes in aneconomic way, namely MOODS (a Model for mOving Objectsin Discrete Space).

L(o, t) : O× T �→ N (1)

Equation 1 shows the signature of the model. Given anobject o in the object set O and a point of time t in the timedomain T, the locating function L derives the location wherethe object o was/is/will be at t. If o is not in the system, theresult of L is nil (indicating “nowhere”).

In our model, the time domain T is continuous becausereceptors are working continuously to capture moving ob-jects. The time-parameterized queries may take any momentas an input. However, as we mentioned above, the spacedomain is discrete. Instead of a continuous infinite domain,MOODS defines the space domain as a finite set of nodesN = {n1, n2, ..., nm}. This set is known to the system becausethey are the locations where the receptors are deployed. Theset is dynamic since new nodes can join and existing nodesmay leave the network. The object domain is the same as theexisting models, i.e. O = {o1, o2, ..., on}.

The function L essentially locates an object. Another im-portant function is to find the trace of an object. Similar to L,we define the trace function TR as:

TR(o, tstart, tend) : O× T× T �→ P (2)

P : {(nk1, nk2

, ..., nkp, . . . , nkl

)|nkp∈ N, 0 ≤ l ≤ m} (3)

Given an object o and a time range (tstart to tend), TR findsthe trace of o during that timeframe. P denotes the domain ofpath. A path is a sorted list of nodes (can be empty) in N. Thesorting is done by the order of the nodes visited by object o(i.e., by time).

503

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Fig. 2: Example of IOP Acquisition

C. Information of Object Path

The work presented in this paper is part of PeerTrack(http://www.cs.adelaide.edu.au/peertrack/), a research projectthat aims at developing novel techniques enabling applicationsto share traceability data efficiently and effectively acrossindependent enterprises in large-scale traceable networks.

In PeerTrack, traceability applications can share data acrossindependent organizations in a P2P fashion. This is realizedby a novel data model that introduces properties to indicatethe departure and arriving information of objects. We call theproperties the information of object path (IOP) [24]. WithIOP, each node maintains segments of objects’ moving pathsand uses this information to expedite P2P queries in thenetwork. This paper focuses on developing generic approachesfor acquiring such information in traceable networks.

III. A P2P APPROACH FOR TRACEABLE NETWORKS

In this section, we will introduce a P2P approach that canprocess traceability queries in an effective and efficient way.

The main idea of our approach is to index an object andits latest location in the DHT network, while keeping the IOPinformation at the nodes where the object has been observed.Instead of a centralized index server, we store the index for anobject at the node which is the result of a DHT lookup for theobject’s id. IOP contains information about the source (i.e.,where the object comes from) and destination (i.e., where theobject leaves to) nodes, thus it can be used to quickly answertrace queries. The place where the index is stored is determinedsolely by the id of the object thanks to DHT’s determinism, sothat from anywhere in the network, the object can be locatedby its id1. The abstract data structure is essentially a doublelinked list distributed in the DHT network. The head of thelist is the node where the latest location is indexed, which wecall a gateway node. When the object moves to a new node,the gateway node is updated and the list is expanded. In ourwork, we adopt Chord [26] as the overlay for its adaptivenessas nodes join and leave. Figure 2 illustrates our idea.

1In order to take this advantage, we hash the object’s raw id using theSHA-1 function. Thus the ids of the objects and the nodes are in the samespace.

Symbol DescriptionNn Number of nodes in the traceable networkL Length of object/node ID in bitsLp Length of prefix in bitsLmin The minimum value of Lp

Tinterval Time interval of data capturingTmax The maximum value of TintervalNo Number of objects captured within TintervalNmax The maximum value of No

Fig. 3: Symbols

The node n7 is the gateway node for object o1. It stores thelatest location of o1. In this case, before o1 arrives at node n4,it was captured at node n3. After it arrives and is captured bya receptor governed by n4, n4 sends a message (M1) to n7 bythe P2P lookup interface, telling n7 that o1 has arrived at n4.Upon receiving this message, n7 updates its index and sendsmessages to n3 and n4. The message to n3 (M2) indicates thato1 arrives at n4, so n3 updates its IOP by setting o1.to=n4.The message to n4 (M3) indicates that o1 was from n3, so n4

updates its IOP by setting o1.from=n3. In this way, the IOPis established and the index is updated.

IOP is essentially a distributed double linked list sorted bytime. In this way the sequencing is guaranteed. To answerqueries about a specific object, the first step is to find out itslatest location by querying the corresponding gateway node.We can then trace back the list and ask each node along thelist for the desired information.

A problem of our IOP acquisition is the indexing cost.Whenever an object arrives at a node, the system has to issue 3messages (1 message from the node and 2 messages from theobject’s corresponding gateway node) in order to establish theIOP links. This is clearly an expensive approach for large-scaleapplications where order of magnitude is often thousand, andobjects often move in groups. In order to solve this problemwhile still having the IOP links established, we next introducea novel IOP acquisition algorithm.

IV. THE IOP ACQUISITION ALGORITHM

Before we introduce the algorithm, we first summarize thesymbols used in the algorithm in Figure 3.

A. Group Indexing Algorithm

In a large traceable network, the data volume might be veryhigh. Indexing each individual object will cause enormousnumber of messages to flood the network. An obvious solutionto solve this problem is to classify the objects arriving at anode within a small period of time into different groups. Inparticular, objects’ raw ids can be hashed using the securehash algorithm SHA-1 and grouped by the prefixes of theirhashed ids. For example, if 1024 objects arrived at a node n

and we choose a prefix length of 4, there are at most 24=16prefixes. As a consequence, instead of indexing all these 1024objects, we simply classify them into at most 16 groups byprefixes (with prefixes as the group ids), and index the groups.

504

Based on this idea, we introduce our enhanced IOP acquisitionalgorithm in the following.

1) Group Generation.: A group function is invoked pe-riodically at time intervals of Tinterval or triggered by otherconditions (to be discussed later) to divide the objects capturedduring this timeframe into groups. Two objects belong to thesame group when their ids have Lp prefix bits in common.

There are two significant parameters in this process. Thefirst one, Tinterval, is the width of windows of the objectstream. We take the objects in the same window for groupingand indexing at one cycle. This parameter is introduced torestrict the size of the indexing message. Unfortunately, afixed value of Tinterval will cause problems when the objectstream is unstable. For example, if an object stream suddenlyhas an extremely high volume (e.g., more products enter thewarehouse in one cycle), the number of objects in the samewindow will go very high thus the size of indexing messagebecomes larger. In our work, we exploit an adaptive schemeto determine the width of windows. In particular, we definea maximum time interval, Tmax, and the maximum numberof objects received at each cycle, Nmax. After Tmax haspassed, or when the number of objects received has exceededNmax, the current cycle ends and the grouping process isstarted on the received objects during this cycle. Meanwhile,a new cycle is started. In order to limit the message size sentfor indexing the objects, we believe an upper bound on thenumber of received objects in each cycle (Nmax) is necessary.With Tmax, it is possible to enable real time indexing sothat the objects will not be delayed too long for indexingwhen volumes become very low. The value of Tmax canbe determined by the system’s timeliness restriction and isconfigurable. Similarly, the value of Nmax can be configured.

The second parameter, Lp, is the key to determine the totalnumber of groups in the system and should be chosen wisely.If it is too big, the number of groups is high and in the worstcase, it is close to the number of objects. As a result, thenumber of messages will not be significantly decreased. Onthe other hand, if it is too small, only a small portion of thenodes in the network will be responsible for indexing, thusthe work load is not well balanced. Essentially, it is importantto find an optimal value of Lp that can guarantee that almostevery node in the network has the opportunity to index at leastone prefix (i.e., group). We denote the probability that a nodehas at least one group to index as δ. Because prefixes aredistributed uniformly at random over the nodes by the hashfunction, according to the probabilistic theory, we have:

δ = 1−

(Nn − 1

Nn

)m

(4)

Where m is the number of groups to distribute, and it is afunction of the length of prefix: m = 2Lp .

The optimal value of Lp should yield the value of δ as closeas possible to 1. In fact, when m← Nn log2Nn, we have:

limNn→∞

δ = limNn→∞

(1−

(Nn − 1

Nn

)Nn log2Nn

)= 1 (5)

And if m← Nn, we have :

limNn→∞

δ =e− 1

e< 1

Thus we can determine the Lp value:

Lp � log2 (Nn log2Nn) = log2Nn + log2 log2Nn (6)

According to Equation 6, we choose �log2Nn +log2 log2Nn� for Lp. As shown in our experiments (seeSection V), the value we have chosen is good enough to keepthe performance while ensuring good load balancing.

As new nodes join and existing nodes leave,Nn is dynamic.As a result, there is no precise way to calculate this value.However, there are some algorithms available to estimate thevalue of Nn. Interested readers are referred to [14] for details.According to Equation 6, Lp increases much slower than Nn.Therefore there is no need to recalculate Lp every time thenetwork changes. Instead, Lp can be adjusted at a relativelylong interval.

During the bootstrap of the network, however,Nn should bere-calculated more frequently to quickly detect a change of Lp

because Nn starts from a small value and changes frequently.According to Equation 6, Lp will also begin with small valuesand keep changing, thus at the beginning, the number ofgroups is small and objects are indexed nearly individually.To solve this problem, we introduce Lmin, a minimum valueof Lp. The value of Lmin is also configurable and applies toall nodes in the network.

2) Algorithms for Indexing and Index Persistence.: Afterthe groups are built, the node that captured objects will sendindexing messages to the gateway node for each group. Thegateway node is determined by the hash value of the groupid. For example, objects belonging to the group “00” will beindexed in the node hash(“00”). The indexing message has theformat of (group id, (objects), timestamp). The gateway nodestores the indexing information for each object individually.

The IOP update process is almost the same as that inSection III except instead of sending one message for eachobject, we send one message for each group of objects whichare from the same node. However, we cannot simply store andlookup the index at the gateway node. There are two issues weneed to address because of the introduction of the groupingscheme.

Firstly, changes of Lp cause grouping inconsistencies. Forexample, at node n1, Lp was 2 and we grouped objects 0000and 0001 into the same group 00. The index was storedat gateway node hash(“00”). Before these objects arrivedat node n2, new nodes joined that caused Lp to become3. After objects 0000 and 0001 arrived, they are groupedinto group 000 and the corresponding gateway node becomeshash(“000”). At this moment, the node hash(“000”) does nothave previous information about these two objects and it may

505

��

��

�

��

��

��

��

�� !��

Fig. 4: Example of Data Triangle

assume that node n2 is the first node for the trace of theseobjects, which in fact should be n1. A similar situation alsohappens when the size of the network decreases. This causesthe IOP information to be updated incorrectly.

Secondly, once Lp is fixed, gateway nodes are also deter-mined. If Lp does not change and nodes do not leave thenetwork, the gateway nodes remain the same. Consequentlynew nodes have no chance to index any group. The load isthen not well balanced.

To solve the above two problems caused by Lp, we in-troduce a distributed data structure called Data Triangle. Adata triangle consists of three nodes in the network. The rolesof them are not symmetric. Figure 4 (a) shows an exampleof the triangle. The gateway node is responsible for indexingobjects whose ids are prefixed by “000” (current prefix lengthis 3). The two child nodes are responsible for indexing “0000”and “0001”. But the child nodes do not index groups directly,instead they are secondary storage for their parent. The parentdelegates part of the indexing data according to the next bitin an object id after the prefix to the child nodes respectively,thereby keeping the workload between the three balanced.

Suppose to increase the current Lp by 1, we need ΔNn

new nodes to join. We have:

�(Nn +ΔNn)log2(Nn +ΔNn)� − �Nnlog2Nn� = 1 (7)

Generally, ΔN is less than Nn. In the worst case, thereare ΔNn nodes idle. For a traceable network whose globalprefix length is Lp, we use extra 2Lp+1 logical nodes (thechild nodes) for indexing. The 2Lp+1 + 2Lp = 3Nn log2Nn

logical nodes are distributed among at most ΔNn+Nn < 2Nn

physical nodes. By similar inference to Equation 6, there is ahigh probability that all physical nodes have at least one groupto index.

With the changes to the network (nodes join and leave), thechild nodes may have their children. Eventually, the indexingwill be distributed in a tree structure (Figure 4 (b)). However,the tree is not necessary. From the analysis above, we cansee that the data triangle is good enough to maintain awell balanced workload. Maintaining a tree introduces longerdelay for looking up. Though trees make the workload morebalanced, it costs more for indexing process. For all newobjects, unless we fully traverse the tree, it is impossible todetermine whether they are new or have historical informationstored somewhere in the tree. In this case, each object will be

looked up in each level of the tree, so for No objects and atree of height h, the total number of DHT lookup operationsis h ∗ No, which clearly is expensive.

To solve this problem, we will not maintain the wholetree all the time. Instead, once the network size changes andcauses the prefix length to change, we start a splitting-mergingprocess. If the prefix length increases, then the child nodesin the triangle become parent nodes in new triangles. Thedata stored in the old parent will all be delegated into thetwo new parent nodes which are its child nodes. Similarly,if the prefix length decreases, the parent node in the trianglebecomes a child node of another node that is indexing a shorterprefix, the parent node’s two child nodes migrate the data theyare indexing to the parent node. Thus eventually we alwaysmaintain only triangles, instead of trees. To look up an objectwhich does not exist locally, we only need to ask the parentand its two children, so for No objects, 3 ∗ No DHT lookupsare enough. Taking Figure 4 as an example, after the prefixlength increases from 3 to 4, the node indexing prefix “000”will split all its data into “0000” and “0001” according thefourth bit of their ids. Figure 5 shows our indexing algorithm.

The algorithm index first tries to update the index storedlocally (line 3) and gets a list of objects whose index is storedeither in ascents or descents by a set difference operation (line1). Then we refresh the local indexing records by searchingup and down in the tree (lines 5 and 6) for the objects in thedifference set. After the index of all objects are downloadedto the local storage, or the tree has been fully traversed, theindex is updated with the new information (line 7).

The two refresh procedures are used to retrieve the in-dexing by traversing the tree up and down, respectively.Before sending the fetching request to the respective nodes,the object list is filtered by the prefix (the filter func-tion in line 4 of refresh_from_ascent and lines 3, 9 ofrefresh_from_descent) for pruning. The address of theparent and children can be cached to save the cost of DHTlookup.

For the update procedure, the system first updates localindexes with the objects to be indexed (line 1). Then itchecks whether it is necessary to delegate some records to thetwo child nodes (line 2). There can be different strategies todetermine this. For example, whether the local storage for thisprefix exceeds a certain amount, or whether there have beena number of records older than a pre-configured time. If adelegation is required, we select the earliest α∗objects.count(0 < α ≤ 1) objects indexed at this gateway and delegate themto the two child nodes. The delegation is similar to the FIFOcache replacement policy. This is based on the observation thatthe latest records are more likely to be read and updated inthe near future. Here α is a global configuration to control thenumber of objects to be delegated. It should be noted that thesplitting-merging process is trivial and we have not includedthe details of the algorithm.

3) Lookup Algorithm.: The object lookup algorithm isstraightforward. We first try to find it in the gateway node forthe prefix of current length. If it cannot be found, we conduct

506

Algorithm : IndexInput: prefix p with length Lp and the set of

objects belong to the group hash(p)Output: None (the algorithm only updates indexing information)1: update local index with local.objects ∩ objects

2: objects′ ← local.objects− objects3: // get the set of objects which are not stored locally4: if objects′ is not empty5: refresh_from_ascent(objects′, p)6: refresh_from_descent(objects′, p)7: update_index(objects′, p)8: end if

refresh_from_ascentInput: prefix p with length Lp and the set of

objects belong to the group hash(p)Output: None (the algorithm refresh local index with ascents)1: i← 12: do3: p′ ← p.sub(1,Lp − i)4: objects′ ← filter(objects, p′)5: if objects′ = Φ break6: objectsparent ← fetch(objects′, p′)7: update local index with objectsparent

8: i← i+ 19: while there exists gateway node for prefix p′ and i ≤ Lp−Lmin

refresh_from_descentInput: prefix p with length Lp and the set of

objects belong to the group hash(p)Output: None (this algorithm refresh local index with descents)1: objects′ ← filter(objects, p+‘0’)2: if objects′ is not empty3: objectschild0 ← fetch(objects′, p + ‘0’)4: update local index with objectschild05: refresh_from_descent(objects, p+‘0’)6: end if7: objects′ ← filter(objects, p+‘1’)8: if objects′ is not empty9: objectschild1 ← fetch(objects′, p + ‘1’)10: update local index with objectschild111: refresh_from_descent(objects, p+‘1’)12: end if

update_indexInput: prefix p with length Lp and the set of

objects belong to the group hash(p)Output: None (the algorithm only updates indexing information)1: update local index with objects2: if delegation is required3: select α ∗ objects.count earliest local indexing records as

objects′

4: delegate objects′ to the two children according to the Lp bitof their ids

Fig. 5: Algorithm for Indexing a Group of Objects

a bidirectional linear search starting from the gateway node.

B. Query Processing

To perform the trace function TR (see Section II-B), weneed to find the gateway node of the given object. From thegateway node, we can discover the node where the object isseen for the last time and simply traverse back using the IOPinformation. In addition, as mentioned in Section III, if anynode along the route from the query requesting node to thegateway node has the information of the object, the trace query

can be processed from this node by traversing backward andforward using the IOP. In this way, we do not have to findthe gateway node of the object. With TR solved, it becomesstraightforward to process queries.

It is worth mentioning that our architecture is built on topof the existing DHT networks and we do not explicitly dealwith the dynamicity of the underlying networks. One reasonis that Chord adapts well with the dynamic networks. Whennew peer joins, only a small portion of nodes will migratetheir data. Similarly, when a peer leaves, it will migrate itsdata to another peer [26].

C. Algorithm Analysis

1) Indexing: The indexing process consists of three phases:grouping, routing and index-persisting. The grouping phasesimply scans the No new objects. Its complexity is Θ(No).

Chord routing takes O(log2Nn) hops with high probability.To route the objects in 2Lp groups will take O(2Lp log2Nn)hops in total. When objects are routed individually, it takesO(No log2Nn) hops. Since the number of received object No

can be very large, while 2Lp = Nn log2Nn is relatively small,our grouping index algorithm can significantly decrease theP2P routing cost.

However, the index-persisting phase is complicated becauseof the Data Triangle structure and the possible trees extendedfrom it. In the best scenario, all the indexing information fornew objects are stored in the gateway node, then the persistingwill only involve local lookup and storage. The complexity isO(No) in the measure of database operations.

In the worst scenario, all the new objects have their historyindexes stored in either ascent nodes or descent nodes, orall the objects are new to the whole network thus there isno historical information in the tree (then the tree has to befully traversed). Index persisting may involve both refreshing(i.e., refresh_from_ascent and refresh_from_descent)and delegation. Suppose that the tree is complete binary andits height2 is h, and all the historical indexing information isstored either in the deepest leaves or in the root. Therefore,it requires at most h DHT lookup to obtain the information.So the complexity of the worst case of the index persisting isO(h ∗No). However, as the tree will be split or merged whenprefix length changes, its height is well controlled. In mostcases h is 1 (only children) or 2 (parent and children). The twotrees in Figure 4 illustrates these two situations respectively.Then the index persisting phase can be done in constant time.

2) Query Processing: There are two cases for query pro-cessing according to where the L function (see Section II-B) isanswered, which will be discussed separately in the following.Gateway. If there is no relevant node along the routing pathfrom the node where the query is issued to the gateway node,then the locate is answered by the gateway node. The routingprocess takes O(log2Nn) hops. At the gateway node, there isa high probability that the lookup algorithm in Section IV-A3takes O(1) hops.

2The height of a tree is defined as the number of edges from the root tothe deepest leaf.

507

Intermediate Node. If during the routing, a node along therouting path has the information for the queried object, therouting will be terminated and the intermediate node will startto process the query. Assuming that the probability for a nodealong the routing path to answer the query is 1

Nn(uniform),

then the average number of hops will be log2Nn(log2

Nn−1)2Nn

.After the node n where the query can be answered is found,

the complexity of query processing is solely based on the typeof the query. For example, L takes O(l) (l is the length of theobject’s lifetime trajectory) DHT lookups in the worst case(when the object was at the end or start of the trace for thegiven time) and O(1) for the best case (when the object was atn for the given time). However, for a trace query that requiresthe lifetime trajectory to be found, it will always take O(l).

V. EXPERIMENTS

Our experiments focused on three aspects. Firstly, we evalu-ated the performance of the indexing algorithms and comparedthe scalability of the individual indexing approach and theenhanced group indexing approach. Secondly, we studiedthe performance of query processing using the techniquesproposed in this paper and compared it with that of centralizedarchitecture. Finally, we examined the effect of the prefixlength (Lp), in the proposed group indexing algorithm, on loadbalancing and indexing performance of traceable networks.

We used the open source P2P simulator OverSim [3] in theexperiments. In our tests, the maximum number of nodes is512 and the maximum number of objects at each node is 5,000,corresponding to the limit of our experimental environment.All experiments were conducted on an Intel Core 2 Quad2.4GHz system with 4GB of RAM.

A. Indexing

The indexing cost, measured by the total volume of mes-sages transferred over the network, is considered to studythe feasibility of the proposed group indexing algorithm andrelated data structures. We first conducted the test with anetwork of 512 nodes and generated a specific number ofobjects at each node. We ran the test 10 times and for the ith

(1 ≤ i ≤ 10) time, the number of objects generated at eachnode is 500 ∗ i. To simulate the movement of objects, 10% ofthe local objects at each node were moved along a trace of 10nodes. We measured both the group and individual indexingalgorithms. The result is depicted in Figure 6a.

From Figure 6a we can see that, when the data volumeis not high (e.g., 500), most of the groups contain only oneor two objects. The number of groups (2Lp) is close to thenumber of the objects (No). Thus the group indexing algorithmcosts almost the same as the individual indexing algorithm.However, with increasing data volume, the indexing cost ofthe group indexing algorithm increases much slower than theindividual indexing algorithm.

We also studied the performance of the indexing algorithmsunder different network sizes. In the experiment, the numberof objects generated at each node is 5,000 and the networksize varies from 64, 128, 256, to 512. The result in Figure 6b

0

100

200

300

400

500

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

index

ing c

ost

data volume (number of objects at each node)

Scalability of Indexing on Data Volume (Dynamic Network)

IndividualGroup

(a)

0

100

200

300

400

500

600

50 100 150 200 250 300 350 400 450 500 550

index

ing c

ost

network size (number of nodes)

Scalability on network size

Individual IndexingGroup Indexing (movement in group)

Group Indexing (movement individually)

(b)

Fig. 6: Scalability of Indexing

shows that with the increase in network size, the indexing costfor the individual indexing algorithm increases linearly whilethe group indexing algorithm shows a sublinear pattern. Thereason is that when the network size increases, Lp increases aswell, which leads to an increase in the number of groups (2Lp)and the messages to be transferred. Generally, the performancedegrades when the ratio of data volume to network sizebecomes small. As we can see, when the size of networkincreases, the indexing cost for the group indexing algorithmbecomes closer to that for the individual indexing algorithm.However in reality, particularly large-scale applications, thereare much more objects than nodes. Clearly, these applicationscan take advantage of the benefits brought by our proposedgroup indexing approach. We also compared the performanceof the group indexing algorithm when the objects move ingroups or individually. Figure 6b shows that the indexing costsless when the objects move in groups, because in this case,objects are more likely to fall into the same capturing window,so that they can be grouped by the grouping algorithm.

B. Query Processing

In this experiment, we studied the performance of ouroverall approach for query processing under different networksizes and different data volumes. We used a typical query“Where has object oi been?” for tracing individual objectsin the study. We also compared the performance of the P2Parchitecture and a centralized approach. For the P2P approach,we added 5ms (typical network latency of T1) as the network

508

0

20

40

60

80

100

120

140

50 100 150 200 250 300 350 400 450 500 550

pro

cess

ing t

ime

(ms)


Query Processing Time on Network SizeP2P

Centralized

(a)

0

20

40

60

80

100

120

140

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

pro

cess

ing t

ime

(ms)

data volume (number of objects at each node)

Query Processing Time on Data VolumeP2P

Centralized

(b)

Fig. 7: Query Processing

latency for each network query. For the latter, we used themodel proposed in [31] to build the same data in a centralizedMySQL database.

Figure 7a shows the query processing time for tracingan object in networks of different sizes. We performed 100queries for different objects and counted the average timeused because different objects may have different length oftraces. From the figure we can see that the query processingtime is almost constant for the P2P approach, but increasesultralinearly for the centralized one. The query time for P2Pnetwork is not relevant to the size of the network, insteadit is a function of the length of trace. For the centralizedapproach, because the data is stored in the same database, thetime for querying is relevant to the size of the database, whichis proportional to the size of the network when the numberof objects generated at each node is fixed. We can also seethat when there are less nodes (i.e. less objects), centralizedapproach outperformes P2P one. However, when the numberof nodes and records in the network reaches certain amount,which is typical in large scale RFID networks, it will run moreslowly than the P2P approach.

Figure 7b shows the query processing time for different datavolumes (the manner in which we generate different volumesof data is identical to that in Section V-A). Similar resultwas obtained (i.e., the time used is almost constant for P2Papproach but ultralinear for the centralized one). The highscalability shown in the experimental results is due to the IOPinformation in our design.

(a)

0

5

10

15

20

50 100 150 200 250 300 350 400 450 500 550

ind

exin

g c

ost

(b

ase-

2 l

og

of

the

nu

mb

er o

f m

essa

ges

tra

nsf

erre

d)


Load Balance for Different Prefix Length Schemes

Scheme 1Scheme 2Scheme 3

(b)

Fig. 8: The Effect of Prefix Length

C. The Effect of Lp

As discussed in Section IV-A2, the choice of Lp signif-icantly affects the performance and load balance of trace-able networks. In this experiment, we studied the differentschemes of Lp and tested their corresponding load balanc-ing capabilities. Figure 8a shows the result for the threedifferent schemes (i.e., Scheme 1: Lp=log2Nn, Scheme 2:Lp=log2Nn + log2 log2Nn, and Scheme 3: Lp=2 log2Nn).Scheme 2 is the one that we have chosen for our system(which is also the scheme used in all other experiments). Westudied Scheme 1 and Scheme 3 in this experiment as they areasymptotically smaller/bigger than Scheme 2, respectively.

We illustrate the load balance by showing the load per-centage (i.e., the number of objects handled by a given setof nodes divided by the total number of objects) for a givennode percentage (i.e., the number of nodes in the set dividedby the total number of nodes). A well balanced scheme shouldyield a linear relationship between the load percentage and thenode percentage (where y = x, i.e., diagonal), meaning thateach node receives the same number of objects to index. Thefarther the curve is away from the diagonal, the worse it is.

As we can see from Figure 8a, when Scheme 1 was chosenas the length of prefix Lp, the load is not well balanced.The curve is far away from the diagonal and shows somesaltations. Scheme 3 performs best among the three becauseit is very close to the diagonal, which implies that the loadis well balanced. However, Scheme 3 makes Lp too long andthe number of groups becomes too big, leading to less objects

509

in each group. This significantly affects the indexing cost (seeFigure 8b). Overall, from our study, Scheme 2 provides a goodchoice for Lp, with which the work load is also well balanced.

We also tested the performance of the three schemes withdifferent network sizes at fixed data volume (5,000 objects ateach node). Figure 8b shows that among the three, Scheme 1 isthe most efficient one and Scheme 3 is the worst (its indexingcost is the square of the number of nodes). However, there isa tradeoff between indexing performance and load balancing.To improve the former, Lp should be smaller, which leadsto poor load balancing. To improve the latter, Lp should bebigger to ensure that all nodes have groups to be responsiblefor, which leads to a higher indexing cost. Scheme 2 showsacceptable results on both indexing performance and loadbalancing. In reality, we can choose different schemes fordifferent scenarios. For example, if the performance of anapplication is very important, Scheme 1 will be a good choice.

VI. RELATED WORK

Successfully tracing objects in a distributed environment isnot a single layer problem. In this section, we overview therelevant techniques to our work.

The first step of modeling moving objects is to abstractthe basic elements such as time, region and velocity. [25]introduces a data model called MOST (Moving ObjectsSpatio-Temporal) for databases with dynamic attributes, i.e.attributes that change continuously as a function of time.Later works mostly use the same or similar idea. With theelements modeled, it is possible to answer basic queries suchas location of an object at a certain time. In order to answermore complicated queries, such as an aggregate query “howmany objects are in region R now?” and a future statequery “where will the object O be after one hour?”, variouscomplex and domain-specific models have been developed. Forexample, in [27], adaptive multi-dimension histogram (AMH)is used in answering aggregate queries about past, present andfuture. Since this kind of work focuses on aggregate queries,it does not address the single-instance queries. In [7], themoving objects are associated with four variables (startingtime, starting location, destination, initial velocity) in orderto predict their future locations. There are some other worksfocusing on various problems of modeling and querying mov-ing objects by employing different techniques. However, theyshare some common characteristics. Firstly, space and time aremodeled as continuous attributes. Secondly, most of the worksdefine region or similar concepts that cover a finite area in acontinuous infinite domain in order to answer range queries.

Index is often used to quickly answer range queries for aspecific attribute. For example, if the space is divided intocells and objects are indexed by cell, it is easy to answerqueries such as “how many or which objects are in cell c?”.However, dividing the space into fixed-size cells does not workwell in dynamic environments because the boundary of thespace must be decided in advance. Early works, includingR-Tree, R∗-tree, TR-tree, TB-tree, TPR-tree, TPR∗ R-treeand some other similar trees [20], dynamically decide which

point or region to index and optimize the indexing processin different ways. [15] considers streaming data that requirefrequent updates and proposes an efficient B+-Tree basedindexing method which represents the moving-object locationsas timestamped vectors. [21] indexes the predicted trajectoriesto quickly predict future locations of moving objects. Manyrecent works such as [4], [9], [16], [29], [34] focus on otherspecific problems and provide corresponding solutions. Theseindex methods all operate in centralized environments.

In recent years, some generic peer-based database man-agement systems [12] have been proposed to support largevolume data and complicated queries. Although these genericsolutions lays the foundation for peer-based spatial databases,they are not dedicated to efficiently manage spatial data andanswer spatial queries. [19] presents an analytical model topredict the cost of query algorithms based on query location,query size and the moving objects’ distribution so that thefinal scheme chosen to perform the query is optimal. [18]proposes a distributed hash technique to answer range queries,which scales well under certain assumption about the querydistributions. [28] extends Quadtree index into a P2P networkto answer range queries efficiently while keeping the loadon the nodes in the network well balanced. [17] provides alinear yet distributed structure that facilitates multiple searchpaths to be mixed together by sharing links. These distributedindex methods are all based on spatial elements such as point,link or region thus they work well for range queries but notsingle-instance queries. In [35], the authors proposed a genericmodel to deal with the event matching problem of content-based publish/subscribe systems over structured P2P overlays.This model is useful in a distributed RFID system if it isevent-driven instead of query-driven.

VII. CONCLUSION

Recent advances in technologies such as radio-frequencyidentification (RFID) make automatic tracking and tracingpossible in a wide range of applications. Unfortunately, re-alizing traceability applications in large-scale, distributed en-vironments such as the emerging Internet of Things presentssignificant challenges due to their unique characteristics suchas large volume of data and sovereignty of the participants.In this paper, we have presented a generic approach thatenables applications to share traceability data across indepen-dent enterprises in a P2P fashion. We built our approach ontop of a DHT-based overlay network. Objects are indexed atdeterministic gateway nodes that are responsible for updatingobjects’ status at the source and destination nodes for theirmovements. In this way, the IOP (information of object path)properties of objects are established, which are critical tosupport efficient processing of traceability queries. To reducethe indexing overhead from massive volumes of data in large-scale traceability applications, we further proposed an en-hanced group-based indexing approach. Extensive experimentsshowed the viability and scalability of our approach.

We view our work presented in this paper as a first steptowards efficient tracking and tracing of objects in the Internet

510

of Things. Our ongoing work includes further performancestudy of the proposed techniques. Another important directionfor future work is to add capabilities for predicting future sta-tus of objects. This typically involves overcoming uncertaintyissues introduced by traceability applications (e.g., incompleteand noisy data) using statistical and probabilistic techniques.

REFERENCES

[1] R. Agrawal, A. Cheung, K. Kailing, and S. Schönauer. TowardsTraceability Across Sovereign, Distributed RFID Databases. In Proc.of IDEAS’06, India, 2006.

[2] H. Balakrishnan, M. Kaashoek, D. Karger, R. Morris, and I. Stoica.Looking up Data in P2P Systems. Communications of the ACM,46(2):43–48, 2003.

[3] I. Baumgart, B. Heep, and S. Krause. OverSim: A Flexible OverlayNetwork Simulation Framework. In Proc. of GI’07, USA, 2007.

[4] V. Botea, D. Mallett, M. A. Nascimento, and J. Sander. PIST: An Ef-ficient and Practical Indexing Technique for Historical Spatio-TemporalPoint Data. Geoinformatica, 12(2), 2008.

[5] H. Chen, W. Ku, H. Wang, and M. Sun. Leveraging Spatio-TemporalRedundancy for RFID Data Cleansing. In Proc. of SIGMOD’10, USA,2010.

[6] A. Cheung, K. Kailing, and S. Schönauer. Theseos: A Query Engine forTraceability Across Sovereign, Distributed RFID Databases. In Proc. ofICDE’07, Turkey, 2007.

[7] H. Chon, D. Agrawal, and A. Abbadi. Storage and Retrieval of MovingObjects. In Proc. of MDM’01, UK, 2001.

[8] M. J. Franklin et. al. Design Considerations for High Fan-in Systems:The HiFi Approach. In Proc. of CIDR’05, USA, 2005.

[9] Y. Gao, G. Chen, Q. Li, B. Zheng, and C. Li. Processing Mutual NearestNeighbor Queries for Moving Object Trajectories. In Proc. of MDM’08,USA, 2008.

[10] N. Gershenfeld, R. Krikorian, and D. Cohen. The Internet of Things.Scientific American, 291(4):76–81, 2004.

[11] H. Gonzalez, J. Han, and X. Shen. Cost-Conscious Cleaning of MassiveRFID Data Sets. In Pro. of the 23rd Intl. Conf. on Data Engineering(ICDE’07), Istanbul, Turkey, April 2007.

[12] K. Hose, A. Roth, A. Zeitz, K. Sattler, and F. Naumann. A ResearchAgenda for Query Processing in Large-Scale Peer Data ManagementSystems. Information Systems, 33(7-8):597–610, 2008.

[13] Shawn Jeffery, Minos Garofalakis, and Michael Franklin. AdaptiveCleaning for RFID Data Streams. In Proc. of VLDB’06, Seoul, Korea,September 2006.

[14] M. Jelasity and A. Montresor. Epidemic-Style Proactive Aggregation inLarge Overlay Networks. In Proc. of ICDCS’04, USA, 2004.

[15] C. S. Jensen, D. Lin, and B. Ooi. Query and Update Efficient B+-treeBased Indexing of Moving Objects. In Proc. of VLDB’04, Canada, 2004.

[16] R. Lange, F. Dürr, and K. Rothermel. Scalable Processing of Trajectory-based Queries in Space-partitioned Moving Objects Databases. In Proc.of GIS’08, USA, 2008.

[17] W. Lee and B. Zheng. DSI: A Fully Distributed Spatial Index forLocation-Based Wireless Broadcast Services. In Proc. of ICDCS’05,USA, 2005.

[18] X. Li, Y. Kim, R. Govindan, and W. Hong. Multi-dimensional RangeQueries in Sensor Networks. In Proc. of SenSys’03, USA, 2003.

[19] A. Meka and A. Singh. DIST: A Distributed Spatio-Temporal IndexStructure for Sensor Networks. In Proc. of CIKM’05, Germany, 2005.

[20] M. F. Mokbel, T. M. Ghanem, and W. G. Aref. Spatio-Temporal AccessMethods. IEEE Data Engineering Bulletin, 26:40–49, 2003.

[21] J. M. Patel, Y. Chen, and V. Chakka. STRIPES: An Efficient Index forPredicted Trajectories. In Proc. of SIGMOD’04, France, 2004.

[22] D. Ranasinghe, Q. Z. Sheng, and S. Zeadally. Unique Radio Innovationfor the 21st Century: Building Scalable and Global RFID Networks.Springer, 2010.

[23] Q. Z. Sheng, X. Li, and S. Zeadally. Enabling Next-Generation RFIDApplications: Solutions and Challenges. IEEE Computer, 41(9):21–28,September 2008.

[24] Q. Z. Sheng, Y. Wu, and D. Ranasinghe. Enabling Scalable RFIDTraceability Networks. In Proc. of AINA’10, Australia, 2010.

[25] A. Sistla, O. Wolfson, S. Chamberlain, and S. Dao. Modeling andQuerying Moving Objects. In Proc. of ICDE’97, UK, 1997.

[26] I. Stoica et. al. Chord: A Scalable Peer-to-Peer Lookup Service forInternet Applications. In Proc. of SIGCOMM’01, USA, 2001.

[27] J. Sun, D. Papadias, Y. Tao, and B. Liu. Querying about the Past,the Present, and the Future in Spatio-Temporal Databases. In Proc. ofICDE’04, USA, 2004.

[28] E. Tanin, A. Harwood, and H. Samet. Using a Distributed QuadtreeIndex in Peer-to-Peer Networks. The VLDB Journal, 16(2):165–178,2007.

[29] K. Tzoumas, M. Yiu, and C. S. Jensen. Workload-aware Indexing ofContinuously Moving Objects. PVLDB, 2(1):1186–1197, 2009.

[30] F. Wang, S. Liu, and P. Liu. Complex RFID Event Processing. TheVLDB Journal, 18(4):913–931, 2009.

[31] Fusheng Wang and Peiya Liu. Temporal Management of RFID Data.In Proc. of the 31st Intl. Conf. on Very Large Data Bases (VLDB’05),Trondheim, Norway, September 2005.

[32] E. Welbourne, L. Battle, G. Cole, K. Gould, K. Rector, S. Raymer,M. Balazinska, and G. Borriello. Building the Internet of Things UsingRFID: The RFID Ecosystem Experience. IEEE Internet Computing,13(3):48–55, May/June 2009.

[33] Sai Wu, Shouxu Jiang, Beng Chin Ooi, and Kian-Lee Tan. DistributedOnline Aggregations. PVLDB, 2(1):443–454, 2009.

[34] M. Zhang, S. Chen, C. S. Jensen, B. Ooi, and Z. Zhang. EffectivelyIndexing Uncertain Moving Objects for Predictive Queries. PVLDB,2(1):1198–1209, 2009.

[35] S. Zhang, J. Wang, R. Shen, and J. Xu. Towards building efficientcontent-based publish/subscribe systems over structured p2p overlays.In In Proc. of ICPP’10, 2010.

511

[ieee 2011 international conference on parallel processing (icpp) - taipei, taiwan...

Documents