design a cloud storage platform for pervasive computing environments

11
Cluster Comput (2010) 13: 141–151 DOI 10.1007/s10586-009-0111-1 Design a cloud storage platform for pervasive computing environments Weimin Zheng · Pengzhi Xu · Xiaomeng Huang · Nuo Wu Received: 15 May 2009 / Accepted: 5 November 2009 / Published online: 15 November 2009 © Springer Science+Business Media, LLC 2009 Abstract An increasing number of personal electronic handheld devices (e.g., SmartPhone, netbook, MID and etc.), which make up the personal pervasive computing en- vironments, are playing an important role in our daily lives. Data storage and sharing is difficult for these devices due to the data inflation and the natural limitations of mobile devices, such as the limited storage space and the limited computing capability. Since the emerging cloud storage so- lutions can provide reliable and unlimited storage, they sat- isfy to the requirement of pervasive computing very well. Thus we designed a new cloud storage platform which in- cludes a series of shadow storage services to address these new data management challenges in pervasive computing environments, which called as “SmartBox”. In SmartBox, each device is associated its shadow storage with a unique account, and the shadow storage acts as backup center as well as personal repository when the device is connected. To facilitate file navigation, all datasets in shadow storage are organized based on file attributes which support the users to seek files by semantic queries. We implemented a proto- type of SmartBox focusing on pervasive environments being made up of Internet accessible devices. Experimental results with the deployments confirm the efficacy of shadow storage services in SmartBox. W. Zheng · P. Xu · X. Huang ( ) · N. Wu Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China e-mail: [email protected] W. Zheng e-mail: [email protected] P. Xu e-mail: [email protected] N. Wu e-mail: [email protected] Keywords Cloud storage · Pervasive computing · Data management 1 Introduction Personal electronic devices such as smart phone, digital me- dia player, netbook, MID and other handheld devices have become widely used in our daily lives. We use them to process and display multimedia data such as music, images and videos, as well as data files such as emails and doc- uments. In common, these devices have intrinsic wireless communication capabilities, which allowing them to con- nect to Intranet and Internet and even to interconnect. All these devices make up pervasive computing environments, which are going to play an important role in the future. However, Data storage and sharing is difficult for these de- vices due to the data inflation and natural limitations of mo- bile devices, such as the limited storage space, the limited computing capability and the intermittent connection. Since the problem of data management in these environments be- comes unavoidable and troubling, a lot of commercial tools, such as iTune and iPhoto are emerging to facilitate this work. However, most of these tools are customized to some spe- cific devices and specific data types, and their functions are limited to data synchronization and backup between devices and the backend servers. The increasing deployment of wire- less communication technologies has made it possible to use online storage for ordinary data storage in pervasive computing environments. At the same time, the emerging cloud storage solutions [1] which can significantly reduce the managerial costs and offer flexible reliable storage, have shown their strong abilities to meet the growth storage re- quirements in pervasive computing environments.

Upload: weimin-zheng

Post on 15-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Cluster Comput (2010) 13: 141–151DOI 10.1007/s10586-009-0111-1

Design a cloud storage platform for pervasive computingenvironments

Weimin Zheng · Pengzhi Xu · Xiaomeng Huang ·Nuo Wu

Received: 15 May 2009 / Accepted: 5 November 2009 / Published online: 15 November 2009© Springer Science+Business Media, LLC 2009

Abstract An increasing number of personal electronichandheld devices (e.g., SmartPhone, netbook, MID andetc.), which make up the personal pervasive computing en-vironments, are playing an important role in our daily lives.Data storage and sharing is difficult for these devices dueto the data inflation and the natural limitations of mobiledevices, such as the limited storage space and the limitedcomputing capability. Since the emerging cloud storage so-lutions can provide reliable and unlimited storage, they sat-isfy to the requirement of pervasive computing very well.Thus we designed a new cloud storage platform which in-cludes a series of shadow storage services to address thesenew data management challenges in pervasive computingenvironments, which called as “SmartBox”. In SmartBox,each device is associated its shadow storage with a uniqueaccount, and the shadow storage acts as backup center aswell as personal repository when the device is connected. Tofacilitate file navigation, all datasets in shadow storage areorganized based on file attributes which support the usersto seek files by semantic queries. We implemented a proto-type of SmartBox focusing on pervasive environments beingmade up of Internet accessible devices. Experimental resultswith the deployments confirm the efficacy of shadow storageservices in SmartBox.

W. Zheng · P. Xu · X. Huang (�) · N. WuDepartment of Computer Science and Technology, TsinghuaUniversity, Beijing 100084, Chinae-mail: [email protected]

W. Zhenge-mail: [email protected]

P. Xue-mail: [email protected]

N. Wue-mail: [email protected]

Keywords Cloud storage · Pervasive computing · Datamanagement

1 Introduction

Personal electronic devices such as smart phone, digital me-dia player, netbook, MID and other handheld devices havebecome widely used in our daily lives. We use them toprocess and display multimedia data such as music, imagesand videos, as well as data files such as emails and doc-uments. In common, these devices have intrinsic wirelesscommunication capabilities, which allowing them to con-nect to Intranet and Internet and even to interconnect. Allthese devices make up pervasive computing environments,which are going to play an important role in the future.However, Data storage and sharing is difficult for these de-vices due to the data inflation and natural limitations of mo-bile devices, such as the limited storage space, the limitedcomputing capability and the intermittent connection. Sincethe problem of data management in these environments be-comes unavoidable and troubling, a lot of commercial tools,such as iTune and iPhoto are emerging to facilitate this work.However, most of these tools are customized to some spe-cific devices and specific data types, and their functions arelimited to data synchronization and backup between devicesand the backend servers. The increasing deployment of wire-less communication technologies has made it possible touse online storage for ordinary data storage in pervasivecomputing environments. At the same time, the emergingcloud storage solutions [1] which can significantly reducethe managerial costs and offer flexible reliable storage, haveshown their strong abilities to meet the growth storage re-quirements in pervasive computing environments.

142 Cluster Comput (2010) 13: 141–151

This paper presents the design and implementation ofSmartBox, which is a cloud storage platform that addressesthe challenges of data storage, sharing and exchanging intro-duced by emerging pervasive computing devices. SmartBoxwhich plays as an online backend storage provides a seriesof shadow storage services to these devices. When a deviceis introduced to SmartBox, an individual shadow storage isallocated and associated with it. Besides, a public storagespaces among different devices are also provided based onthe need of data sharing at home and at office. We want tobe representative of the user population, most of who justwant to use the system and are unwilling to deal with com-plex operations. Therefore, SmartBox is designed to be self-managing and requires no oversight or input from the userduring normal operation.

A recent study [2] shows that the users are used to or-ganizing and accessing their data via data attributes insteadof traditional hierarchical naming. The users are likely tonavigate files via attributes like publisher-provider metadata,extracted keywords, and time stamps, just like popular com-mercial data access tools, iTunes, iPhoto and etc. However,the underlying file systems supporting these semantic nav-igation interfaces have remained tightly tied to hierarchicalnamespaces. The inconsistence between the data access anddata storage brings the users troubles in translating the se-mantic views to hierarchical views when operating files.

SmartBox is designed to be an attribute based, cloud stor-age platform. In SmartBox, each devices is associated to in-dividual shadow storage. The shadow storage acts as backupcenter as well as personal repository when the device is con-nected. The data exchanging between devices becomes sim-ple and feasible by means of corresponding shadow stor-age. In addition, SmartBox provide public storage space togroups of devices for data sharing. Furthermore, all datasetsin shadow storage are organized based on file attributes tofacilitate file navigation. This approach allows the users tospecify files by semantic queries. In order to evaluate theperformance and feasibility of SmartBox, we have imple-mented a prototype of the cloud storage platform, whichconsists of a distributed file system as backend and an setof client utilities.

Out main contribution is in three folds. First, we designand implement a prototype of attribute based cloud storagesystem, SmartBox, for pervasive computing which providessimplicities to the end users by organizing datasets basedon attributes; Secondly, we present access utilities based onSmartBox to facilitate the data storage and sharing in per-vasive computing environment by using our cloud storageas the shadow storage for devices which makes it easy toshare data between devices; Finally, we evaluate the wholesystem thoroughly using a set of micro-benchmarks to provethe usability and feasibility of SmartBox.

This paper is organized as follows. Section 2 introducesprior work in data management and data storage in perva-sive computing environments. Section 3 describes a designoverview of SmartBox and an in-depth description of ourdesign. In Sect. 4, we present the implementation of the pro-totype and study the feasibility of SmartBox using experi-ments focusing on performance and overhead. The experi-mental results are also provided in Sect. 4 and 5 concludeswith a discussion of the implications of our work and de-scribes the directions for future work.

2 Related works

Cloud storage is becoming popular since the successful sim-ple storage service (S3) [3] of Amazon. The infinite storagecapability of cloud seems to meet the storage needs of mo-bile device exactly.

Some commercial cloud storage platforms provide reli-able storage services and can be friendly accessed throughInternet. Most of them, such as Amazon S3, Microsoft LiveMesh [4], Mozy [5] and Symantec’s Protection Network(SPN) [6], focus on providing interfaces of storage similar totraditional file system. Specifically, Amazon S3 is designedto be general purpose storage utility with simple and openstandard operations, which has been widely used to buildnew applications. Live Mesh is targeted at the data synchro-nization between online storage and local devices. Mozy andSPN focus on the data backups, which are used to providedata protection of critical files against disasters. Similarly,Cumulus [7], Brackup [8] and Jungle Disk [9] are designedfor the purpose of data backup.

In general, the goal of the above systems are differentwith SmartBox. They are not designed with special opti-mizations for mobile devices, which prefer to data nav-igation based on attributes. SmartBox provides the userswith the choice of a traditional hierarchical namespace fordata access and an attribute based namespace for semanticqueries.

We are addressing the problem similar to Perspective [2]in terms of facilitating the data navigation in digital homesfor the home users. Both systems are storage systems inconsideration of mobile devices and provide semantic filesystem constructed to simplify data management. Perspec-tive studies on the decentralized data management to allowa collection of devices to share storage without requiring acentral server. Likewise, EnsemBlue [10] addresses the chal-lenges of integrating consumer electronics devices into ex-isting distributed systems. FEW [11] presents a distributedfile system that manages files stored in computing devicesand portable storage devices. These work have investigatedthe data synchronization, replication strategies and the avail-ability among devices, which have been widely studied in

Cluster Comput (2010) 13: 141–151 143

previous work, like Bayou [12], Ficus [13] and Coda [14].In addition, they put emphasis on the exploitation of theportable storage in mobile environments. In our work, wefocus on the emerging cloud storage infrastructure to facili-tate the data management of mobile devices. We have tack-led the problems of providing semantic queries for cloudstorage systems.

The idea of using semantic queries to facilitate data man-agement tasks is not novel. Considerable research has beenconducted in the field. The Semantic File System (SFS) [15]is the first concept file system proposing the use of attributequeries to locate data, and subsequent systems showed howthese techniques could be extended to meet specified re-quirements. The following Hierarchy And Content (HAC)[16] is designed to benefit from both hierarchical and se-mantic file systems. Recently, many systems, such as Desk-top search engines (Google Desktop [17], Beagle [18] andWinFS [19]), borrow from the ideas of SFS by adding se-mantic information to file systems with traditional hierar-chical naming. Naturally, SmartBox refer to this semanticattribute based storage and search techniques from a richhistory of previous work. It is our attempt to build a semanticcloud storage for mobile devices as well as personal comput-ers. We have to not only focus on the functions of semanticqueries, but also take into considerations of the cloud stor-age infrastructure which is critical to the increasing numbersof the users.

The SmartBox is such a cloud storage system addressingthe data storage and sharing issues for mobile devices. Likeother cloud storage, Google File System (GFS) [20] andHadoop File System (HDFS) [21], SmartBox shares withthe similar assumptions. It is built based on the commoditycomputers, is aimed to provide mass storage services and isa large scalable distributed system. Unlike GFS and HDFS,SmartBox is designed to be an attribute based storage sys-tem to facilitate the use of mobile devices. In addition, it isa cloud storage platform for the end users, in the form of anintegrated online storage solution.

3 System design

In this section we present the design of SmartBox for mak-ing the data management convenient in pervasive comput-ing environments. The SmartBox system is designed for thestorage of personal data such as songs, documents, personalvideos and favorite movies, which are different from thefeature of large datasets in scientific experiments. In com-mon, these files can be written once when created and onlyread and list operations are allowed after that. In a sense,the type of files supported by our SmartBox is Write-Once-Read-Many.

The goal of SmartBox is to provide storage servicesfor portable devices in pervasive computing environments.

SmartBox is not only intended for home environments, butalso for broader environments such as offices, enterprisesand campuses. To achieve these aims, we have designed anunderlying attribute based distributed file system and a setof access utilities to meet the growing number of users andthe urgent demands of data storage in pervasive computingenvironments.

3.1 Motivation

When we began this work, our objective was to develop acloud storage system for data accessing for mobile devices.Mobile devices are limited to storage capability, while thecloud can provide infinite storage. It seems that the storagedemands of mobile devices mesh well with the offerings ofcloud storage. Furthermore we have found that many mobiledevices are sharing files, such as photos, movies, e-booksand etc. Therefore, we started our work to build a cloud stor-age system for data storage and sharing in pervasive comput-ing environments.

At the beginning we adapted Carrier [22] which is a dis-tributed file system developed by our research group to ac-complish this task. Our choice of Carrier was driven by sev-eral factors. Carrier is built from many inexpensive com-modity computer and also provides good scalability andhigh reliability. Carrier is also designed to meet the datastorage and sharing needs of users in our campus, which ad-dresses the similar problems in the data management of mo-bile devices. Finally, Carrier has already been widely in usefor months in the campus. If Carrier could be used to sup-port for mobile devices, then it would be possible to shareexisting files in Carrier between mobile devices and per-sonal computers. By mid-March there has been more than5,000 registered users, nearly twenty percents of the totalstudents in our campus. Currently, the system is used more2500 person-times everyday. The storage capability is up to70 TeraByte and the whole throughput is about 1.3 TeraByteper day.

We have conducted evaluations on Carrier for data man-agement of mobile devices. There are several encouragingrespects. We have found that the group storage for data shar-ing between devices is useful. Users belonging to the samegroup share the same storage. If one of the users adds a fileto the group storage, the others can access it anytime andanywhere regardless of the location of the file owner. Theinfinite storage provided by cloud makes it easy to managefiles for mobile devices. Users do not need to make the trade-off of the limited storage space and the ever-expanding data.Users just put all the files into Carrier. Accessing the cloudstorage means carrying all the files, so that users can get thefavorite data at any times and places. Further, with the helpof Carrier, data management of mobile devices becomes uni-fied and easy. Users do not need considerable discipline and

144 Cluster Comput (2010) 13: 141–151

Fig. 1 The architecture ofSmartBox cloud storageplatform

foresight to ensure data consistency and availability, and toavoid data loss. All the data are stored in cloud, which isalways the right place for data accessing.

Unfortunately, our evaluation of Carrier also revealedseveral problems. The big problem is the approach of orga-nizing and navigating large amounts of files in Carrier, suchas music, movies and videos, which are stored and organizedin hierarchical namespaces. Though most of the users are fa-miliar with the hierarchical data organization of their PC, theusers of mobile devices prefer to attribute based data naviga-tion since they are likely to navigate files through attributeslike publisher provider metadata, extracted keywords, timestamps and so on. For example, some mobile devices arededicated to special purpose, such as the digital cameras andMP3 player. When using these devices, the users expect toget the data with similar properties, such as the songs be-longing to the same album, the videos taken during the sameperiod and etc. However, the hierarchical data organizationis still indispensable to the other general purpose devices,such as laptops.

To address the problem, we have to build a new cloudstorage system based on the experience of Carrier and en-hanced it to support both attribute based data navigation andhierarchical data organization.

3.2 Architecture

SmartBox is an online cloud storage platform designed notonly for personal computers, but also for mobile devices.

It has a layered architecture. As showed in Fig. 1, on thehighest level, we have various end users interacting withthe system via the provided utilities. SmartBox consists ofstorage resources, SmartBox File System (SmartBoxFS) andSmartBox utilities. At the bottom level, the storage resourcelayer contains heterogeneous distributed resources, includ-ing commodity computers, enterpriser storage servers andother high performance computers. In the heart of the sys-tem, we implement SmartBoxFS, which is built from theunderlying heterogeneous resources. SmartBoxFS is criticalfor the system since it is responsible for the data storage andmanagement for users. It is developed from Carrier whichis driven by the basic idea and conception of Google FileSystem. Therefore, it brings us considerable system perfor-mance and reliability. In SmartBoxFS, we also have madesome improvements on the namespace management to pro-vide flexible namespaces. In this paper, we concentrate onthe organization of namespace, which is important for usersto navigate a large number of files. We will describe thenamespace management in details later in this section. Fur-ther up, we provide three different kinds of utilities for theusers to access and manage the data in SmartBox.

3.3 SmartBoxFS

SmartBoxFS is a distributed file system made up of meta-data servers and chunk servers. Files in the system are di-vided into fixed-size chunks. These chunks are stored on thelocal disks of chunk servers as ordinary files. They can be

Cluster Comput (2010) 13: 141–151 145

accessed by the users directly in order to alleviate the burdenof metadata servers. Therefore, Chunk servers take chargeof the data storage as well as data transfer. The metadata offiles and chunks are collected and maintained by metadataservers. Metadata servers are responsible for the data man-agement, namespace management, data mapping and sys-tem monitoring. Logically, SmartBoxFS is made up of sev-eral functional components including data transport and datastorage which are implemented by chunk servers, and sys-tem monitor, data mapping keeper, data management andnamespace management which are implemented by metaservers.

A typical data access session involves the followingsteps: (1) A user connects to one of the metadata servers,and requests the namespace based on attribute based query.(2) The metadata server generates the matched namespacedynamically using given attributes. (3) The user translatesa named entity in the namespace into chunks, and send re-quests to metadata server for chunk locations using ChunkID. (4) The metadata server replies the location of chunkfrom data mapping which maintains the mappings fromChunk ID to the locations of corresponding chunk servers.(5) The users connect to one and more chunk servers in-dicated by the returned locations directly to perform datatransport task.

Since the files are stored as chunks in chunk servers,which would be commodity computers in our system,chunks are replicated on multiple chunk servers for reliabil-ity. In addition, in order to keep track of chunk servers, weuse bootstrap and periodic heartbeat mechanisms to moni-tor the states of servers. When a chunk server is added toour system, the metadata server adds a record for it. Ac-tually, the metadata server keeps an expired timer for eachactive chunk server. If the expired timer exceeds 3 heart-beat intervals, the corresponding chunk server is markedas inactive. An inactive server is removed actually when itkeeps in the inactive state for an enough long period (e.g.one or two days) or manually. Once a chunk server becomesinactive, all the mappings which indicate the locations ofchunks stored in the inactive server, are cleaned up. If thechunk server goes down at the moment that the users areperforming read operations, the exceptions can be caught bythe SmartBox utilities, and tasks will be restarted using theother replicas automatically. In another case, if the users areperforming write operations, the consistency mechanism,which commits metadata update after the completion of datatransport, assures that the breakdown can not leave over cor-rupted metadata in our system.

In SmartBox, files are divided into fixed-size chunks,which is the atomic unit for data storing. The size of chunkis chosen to be 2 MB, in consideration of the large num-ber of multimedia files stored in our system. However, theusers have to deal with operations on lots of small files,

such as documents, low resolution photos and voice records,which are stored and transferred as individual chunks. Thetransport of these small chunks will bring relatively largerprotocol overhead, and may be slower over high-latencyconnections. We split chunks into slices which are basicunit for data transport. A slice is a self-description entitywhich contains a fixed length segment of data. It can besent and received independently and allows out-of-order ar-rival. Therefore, by sending and receiving the batches ofslices in parallel, we can reuse the connections to reducethe overhead caused by three-way handshake and the impactcaused by slow start and congestion avoidance mechanismof TCP.

3.4 Attribute based namespace

In SmartBox, we provide both traditional hierarchical andattribute based namespaces to meet various needs of differ-ent users. The metadata maintained in the metadata serverincluding the mappings from files to chunks, the mappingsfrom chunks to chunk servers and the attribute based name-spaces indicating the user-friendly structure of a set of files.To be different from hierarchical namespace, which pro-vides static tree view of files for the users, attribute basednamespaces are dynamically generated based on the seman-tic queries of the users. Semantic queries can be describedin terms of (attribute, value) pairs. In order to support suchqueries, additional mappings from attributes to files areneeded. Based on the attribute mappings, files are associ-ated to collections of attributes that can be used for dynamicgeneration of namespaces. For example, if a file is associ-ated to the attributes, (datatype,music) and (artist,FIR), theusers can get the namespace containing the file by submit-ting a query to the metadata server, just like “datatype =music and artist = FIR”.

In addition to attribute based namespaces, SmartBox alsosupports traditional hierarchical namespaces. In order tomake attribute based namespaces compatible with hierar-chical ones, we build hierarchical namespaces based on at-tributes mappings automatically. As showed in Fig. 2, in hi-erarchical namespace, the attributes of files are translatedinto directories, which make up the path indicating the files.For example, the music file with the attributes (datatype, mu-sic), (musictype, classical), (artist, mozart) and (filename,Sinfonia Concertance K364), is translated into the path“/music/classical/Mozart/Sinfonia Concertance K364”. Aswe can see that, the path is the combination of the attributes.In fact, we choose only a subset of attributes, which canidentify files in SmartBox sufficiently, for the translation.Different sequences of the combination will generate differ-ent paths. Therefore, we have to define different weight foreach attributes to unify the sequence of attributes. In the ex-ample, it is obvious that the attribute, datatype, has the high-

146 Cluster Comput (2010) 13: 141–151

Fig. 2 The hierarchicalnamespace generated from fileattributes

est weighting, while the weighting of filename is the lowest.Though we can translate attribute based namespaces into hi-erarchical ones, it does not work in reverse. Once the name-space is generated, traditional users may continue to workbased on it. The deletions and creations of files and directo-ries make translations complicated. Even worse, the total re-organization may take place, if a traditional user takes overthe management task of the storage from a semantic user.Therefore, paths of files in hierarchical namespaces does notrepresent the corresponding attributes not at all.

The arbitrary operations without any restrictions in hi-erarchical and attribute based namespaces will bring outpotential filename conflictions in hierarchical namespace.For instance, a user created a file with the path, “/mu-sic/classical/mozart/Sinfonia Concertance K364”, in hierar-chical namespace without adding any attributes. Then theuser switched into attributed-based namespace. Whereas, wecan not find the file “Sinfonia Concertance K364” for thelack of attributes descriptions. Afterwards the user addedthe same file with detailed attributes once again. As a result,the filename confliction happened when the user switchedback into hierarchical namespace. The same works for theconflictions in attribute based namespaces. In order to avoidthe situations, the conflict detections are indispensable foradd, move and copy operations. The detections are carriedout across namespaces, and bring additional overhead formetadata server. However, the metadata cache in clients canreduce the burden of metadata server.

SmartBox provides shadow storage services for per-sonal data storage and sharing. It means that different usershave individual namespaces based on their personal storagespace. Actually, the attribute mappings are managed sepa-rately according to different users. In SmartBox, the seman-tic queries only require searching through all the attributemappings belonging to the requesting user. Therefore, thegeneration of namespaces may not incur performance prob-lems in view of the reasonable total amount of files for per-sonal users.

3.5 Cache and consistency

In SmartBox, the metadata server has to take on heavy re-sponsibilities in generating namespaces, translating fromfiles to chunks and from chunks to locations. In addition,the maintenance of various mappings, the monitoring ofnodes states, and ordinary data management such as peri-odic garbage collections all belongs to the metadata server.Obviously, we are in need of data cache in clients to sharejobs with servers, so that it does not become a bottleneck.

Clients cache personal metadata such as attribute map-pings, file mappings and chunk mappings. Therefore, name-spaces can be generated locally and interactions with serverscan be reduced. As a result of data cache, it brings about theconsistency problems in some cases.

SmartBox provides a very simple consistency model toaddress the issues. Since there is no write operation allowedafter committed, there is no data cache coherence problem.But SmartBox does not provide explicit synchronization ofcaches. So there are still three issues to be addressed:

(1) The operations incurring modifications of metadatamust be committed to the metadata server instead of localcache. With the exception of write operations, delete, moveand attribute edit operations are allowed after committed.These operations must be delivered to metadata server di-rectly to make sure that metadata in the server always keepsup-to-date. So that, the latest metadata can not be lost evenif the clients suddenly left the system.

(2) To simply the design, no explicit operation locksare provided in SmartBox. Since the access pattern con-forms to WORM (Write Once, Read Many) model, a filedoes not become accessible until the completion of initialwrite operation. Afterwards, further write operations are notpermitted. So write-read lock becomes unnecessary in oursystem. However, there are several issues need to furtherdiscussed, including concurrent writes and the conflictionsamong delete, move and attribute edit operations. In spite ofthe write protection of files after the completion of the firstwrite, concurrent writes before the completion may bring us

Cluster Comput (2010) 13: 141–151 147

troubles if there are not locks. In fact an implicit write lockis used to avoid the race condition in this case. However,something strange may happen in consequence. The userscan not write the file even if it can not be found in SmartBox.The reason is that the file has been written but have not beencommitted yet. The lock will be released automatically afterthe close of the write session. The close may be incurred bythe completion of write as well as the unexpected crash ofclients. The conflictions among delete, move and attributeedit operations are left in SmartBox. That means client Acan delete a file which has being opened by client B. Afterthe deletion, the file can still be operated by B until the closeof the file. It because that our garbage collection mechanismwill leave the physical data for enough time before reclama-tion. Obviously, such inconsistency is acceptable in our datasharing and storage environment.

(3) SmartBox has not provided guarantee about the syn-chronization of caches. As a result, clients can occasionallyaccess deleted files and generate out-of-date namespacesbased on local cache. However, the files and the namespacesare integrated and accessible. Moreover, manual synchro-nization can be used ahead of the critical operations. Be-sides, the client utilities carry out synchronization automati-cally before delete, move and attribute edit operations.

3.6 User utilities

To facilitate the operations in the system, we provide a fileexplorer tool and a user space virtual file system respec-tively. Furthermore, a web portal is also developing for themanagement of users and groups.

Above all, the most important thing for the users is thedata locating and accessing. There are two ways to navigateand access the semantic data, the direct search query and thedata navigation. Direct search queries are supported by boththe explorer and the file system. While the data navigationwhich presents visualized guidelines for the users, is imple-mented by the explorer.

The search queries conform to a set of simple customizedsyntax rules. (1) The query is made up of a sequence oftriples like attribute, operator, value; (2) The attribute mustbe chosen from the given attributes set; (3) The opera-tor can be one of the following operators. They are “eq”,“ls”, “neq”, “gt”, “ngt” and “nls”, which represent “equal”,“less”, “not less”, “greater”, “not greater” and “not equal”respectively. We can find that each triple indicates a basicquery condition. The query result that returns as a specificattributed-based namespace, must meet all the conditions.That means not all logical operations but only “AND” issupported currently. However, in most cases, it is enoughto describe the target files.

The data navigation is based on the idea of hierarchicalnamespace. In fact, the data navigation is the process of trav-

eling over a particular hierarchical namespace which is al-ways generated according to the attributes regardless of theactual hierarchical files structure.

Currently, all the attributes of files except of filenameshould be edited manually using the explorer. Once files areadded into SmartBox, the filename and datatype are definedautomatically. The datatype is defined as other, and the file-name is defined as the actual filename. Therefore, we canfind the file in “other” directory in the hierarchical name-space, once created without editing attributes. In current im-plementation, the edit is still boring. The users have to do thesimilar jobs as data navigation to complete the setting. Forexisting files, deleting and modifying of current attributesand inserting of additional attributes are supported in the ex-plorer.

4 Implementation and evaluation

In this section, we report an extensive performance studyconducted to evaluate SmartBox. We study SmartBox in twoaspects. First we study its performance about data opera-tions, then we look at the performance about metadata oper-ations.

4.1 Implementation

We implemented the SmartBox with the features discussedin previous Section. The server implementation is based inErlang, a functional programming language. It was deployedas a set of modules which make up an application. Manyfeatures of Erlang, like dynamic reloading of modules andbuilt-in fault tolerance mechanism, make our design easierto implement, easier to experiment with and easier to up-grade.

As mentioned in Sect. 3.3, our design uses chunk replicasto improve the reliability of the system. In the implementa-tion, we choose the number of replicas to be two. The strictsynchronization of replicas bring extra overhead of upload-ing. Therefore we use asynchronous replication for perfor-mance considerations. It means that the uploading of pri-mary replica determines the waiting time of the users.

The reliability of SmartBox is based on build-in fault tol-erance mechanism of Erlang. Processes are under supervi-sion of erlang supervisors. Restart strategies can be takenimmediately if some process fails. The successor will run inanother node if the original node fails. On the other side, allthe chunks are replicated on two chunk servers. However,the data is replicated asynchronously in order to improvethe response time of user operations. In fact, there is a trade-off between performance and reliability. The reliability canbe improved by increasing the number of replicas and us-ing a hybrid replication scheme. For example, the number of

148 Cluster Comput (2010) 13: 141–151

replicas changed to five, and the three replicas are chainedreplicated synchronously and the other two are replicatedasynchronously. However, in the prototype the data are repli-cated asynchronously for simplicity of implementation andthe performance acceleration.

In our system, there are two clients available, besides ofthe web portal. One of them is implemented based on fuse1

and fuserl,2 which provide access to the Linux file systemAPI from user space and expose corresponding Erlang bind-ings. Therefore, we can expose SmartBox as the local filesystem for users. Another client implementation, which pro-vides a portable GUI interface for users which implementedbased on Java RCP.

The available tools and packages facilitate the implemen-tation of our prototype. However, the obvious disadvantageis a penalty in performance compared to a native C/C++ im-plementation, but as we will show, the prototype can meetthe desirable requirements and provide acceptable perfor-mance for the users.

4.2 Experiment setup

In order to judge the feasibility of our design, we need mo-bile devices, such as smart phones, digital media players andgaming devices to build the intended deployment environ-ment for SmartBox. Unfortunately, few devices are avail-able in hand. In addition, special client tools need to be im-plemented to support various mobile devices. In this paper,we focus on the feasibility of the system design based ona simplified prototype system. Therefore, we carry out theevaluations in a simulated environment made up of desktopcomputer and laptops.

To measure the performance of the prototype, we usedseventeen dedicated desktop PCs running Ubuntu 8.04,each with a Intel Xeon 5110 1.60 GHz, 2 GB of RAM,a 7200 RPM 150 GB SATA hard disk and a 1000 MbpsEthernet card. Eight of these computers are assigned to bechunk servers, another one is specified to be metadata server.The other eight machines act as clients. Both the clients andservers are connected to the same 1000 Mbps switch. ACompaq N620c laptop with a 1.4 GHz Pentium M proces-sor, 768MB of RAM and Netgear WG511 54 Mbps wirelessPC card is also employed to act as the netbook. The netbookis connected to the servers via the campus wireless network.

For the purpose of accelerating the process of experi-ments, hundreds of files representing songs and photos, andtens of films as well as the corresponding attributes are im-ported into SmartBox automatically by a customized tool.In fact, the files are generated randomly. However, it does

1File system in user space. http://fuse.sourceforge.net/.2Erlang bindings for fuse. http://code.google.com/p/fuserl/.

Table 1 Performance of upload and download

Operation Duration Total size Throughput

Upload 131,639 ms 256 MByte 15.56 Mbps

Download 156,335 ms 256 MByte 13.10 Mbps

not affect the results of our evaluations. Because we are notinterested in the content of the files, but interested in the at-tributes. Therefore, based on the imported data collections,attribute based data navigation can be carried out in more arealistic scenario.

We have registered nine personal accounts in SmartBoxfor testing. There are 50 GB storage space permitted to usefor each account. Initially, twenty percents of the storagespace has been filled with the data collections as describedabove. The data stored in a personal account is invisible forthe other accounts.

The performance of SmartBox is evaluated in a numberof different situations. The data throughput of large file andsmall file operations, as well as the overhead of metadataoperations are examined.

4.3 Data operations

We firstly evaluate the performance of data operations inSmartBox on a micro-benchmark representing a workloadcommon to mobile devices: a digital camera uploading a clipof video and a digital media player downloading it from thesystem. For example, a digital media player shared accountwith a digital camera. Therefore, the data generated by dig-ital camera can be stored in the system and retrieved by theplayer. In this micro-benchmark, the laptop carries out theuploading and downloading tasks. First, the laptop uploadsa file with size of 256 MByte into the system. Then it down-loads the file from the system. In order to avoid the influ-ence of caches of the underlying systems, before the start ofdownloading test, the client downloads a set of training filesto make the caches clear.

Table 1 shows the average time to complete uploadingand downloading tasks out of five rounds. In order to ob-tain a baseline, we use iperf to measure the network per-formance. The network bandwidth which is the baselineis also measured out of five rounds. The end-to-end band-width between the server and the client is 15.6 Mbps. Inthe experiments, we have found that there is a gap of about2.5 Mbps between the download throughput and the base-line. The inefficient implementation of file assembling ac-counts for it. In the uploading process, the files are splitinto the chunks which are uploaded to the chunk servers. Inthe download process, the files are assembled from chunkswhich are downloaded from the chunk servers. However,

Cluster Comput (2010) 13: 141–151 149

Fig. 3 Performance of upload and download as file size varies in SmartBox

our current implementation of file assembling has not beenoptimized yet. Extra data copying and inefficient data as-sembling bring big overheads besides of data transferring.Whereas the performance of data operations on large fileslooks acceptable in our system.

In order to make thorough study on the performance ofdata operations in SmartBox, we conduct another groupexperiment of operations on files with various sizes. Inthe experiment, the size of files varied from 16 KByte to64 MByte. In a single round, a set of files with the samesize, which are summed up to 64 MByte, is first uploaded bythe laptop and downloaded later. Figure 3 shows the resultsof the experiment. It plots both the total time of the opera-tions and the data throughput of downloading and uploadingrespectively. These results clearly demonstrate the overheadof metadata operations associated to the data operations.

When the size of files decreases below 256 Kbyte,the achieved throughput shrinks drastically, from about12 Mbyte to 2 Mbyte. Extra requests and replies of locat-ing and allocating of chunks are needed to accomplish dataoperations. Therefore, small files bring more overhead com-pared with large ones. We can observe that the upload op-erations change rapidly compared with the download ones.Because the write ones suffer from the remote operationsof metadata. The metadata cache of client speeds up themetadata operations for read. Whereas, for write operations,metadata must be synchronized with the metadata server.

Moreover, it can be observed that the upload operationsalways outperform the download ones. It is part of the rea-son that we have not adopted any data cache mechanism inchunk server. Therefore data are read from disk directly indownload tests. On the other hand, write operations in up-load tests benefit from the cache of local filesystem in chunkserver. Furthermore, the situation was aggravated by thedrawback of our prototype implementation. we will makesome improvement in the future work. The exceptional per-formance drop at 64 MByte is the result of the thrashing ofpublic campus wireless network.

Table 2 Performance of metadata operations

Operation Duration Total number

list (hierarchy) 140 ms 10,000

list (attribute) 282 ms 10,000

remove 8472 ms 10,000

make dir 39307 ms 10,000

4.4 Metadata operations

We have studied the metadata operations associated to dataoperations in the previous experiments. However, there arestill other essential metadata operations needed to be stud-ied. Our interested operations includes list, remove and makedir. Currently, our system supports both list of hierarchicalnamespaces and list of attribute based ones. In order to studythe performance of metadata operations, we import 10,000songs with the attributes of the same artist. The attributebased list operation lists the songs of the artist. In addition,we also import 10,000 text files without explicit attributes inthe same directory in the hierarchical namespace. The hier-archical list operation lists the files in the directory. In theexperiments, durations of the operations are calculated re-gardless of the delay of displaying. As showed in Table 2,we can find that the hierarchical list operation is faster thanthe attribute based one. Since the both list operations are car-ried out locally based on metadata caches, the performanceis always acceptable. The remove and make dir operationsare much slower than list. Because these operations needto communicate with the metadata server. Furthermore, theoverhead of cache synchronization aggravates the perfor-mance degradation. As mentioned in Sect. 3.5, cache syn-chronization between local and metadata server will be car-ried out automatically before remove operations. In the ex-periment, we remove the files via removing the directorywhere the files reside. Therefore the single remove operationcan be completed quickly. However we take into considera-tion the overhead of automatic cache synchronization before

150 Cluster Comput (2010) 13: 141–151

remove operation, the experiment result of remove operationis much slower than the expectation. Compared with makedir operation, remove has better performance. The garbagecollection which actually removes the unused files asyn-chronously, accounts for the acceleration of remove.

5 Conclusion

It is clear that the cloud storage and mobile devices are beingwidely used. However, how to make the capability limitedmobile devices benefiting from the growing cloud storage isstill a big challenge. In this paper, we have described our at-tempt to bring cloud storage into pervasive computing envi-ronments to facilitate the data storage and sharing of mobiledevices. We present the design of our prototype SmartBox, acloud storage platform for mobile computing environments.To ease the data management for user, in SmartBox, files canbe navigated in attribute based namespaces as well as tra-ditional hierarchical namespaces. In attribute based name-space, semantic queries can be used to facilitate the locatingof the desirable files. In addition, for the purpose of datasharing, we provide group storage besides of personal stor-age. Our extensive experimental studies show that SmartBoxis a promising system for data storage and sharing.

Our future work will focus on improving the attributebased namespace so that we can support complex queriesand self description attributes. We also plan to optimize theperformance of SmartBox and develop some customizedclient tools for various mobile devices to carry out practi-cal experiments. In addition, ad hoc data storage and sharingamong devices should be supported in our system.

Acknowledgements This work is co-sponsored by Natural Sci-ence Foundation of China (60673152, 60773145, 60803121), Na-tional High-Tech R&D (863) Program of China (2006AA01A101,2006AA01A106, 2006AA01A108, 2006AA01A111, 2006AA01A117),National Basic Research (973) Program of China (2004CB318000),and Tsinghua National Laboratory for Information Science and Tech-nology (TNLIST) Cross-discipline Foundation.

References

1. Foster, I., Zhao, Y., Raicu, I., Lu, S.: Cloud computing and gridcomputing 360-degree compared. In: Proceedings of Grid Com-puting Environments Workshop, 2008, pp. 1–10. Austin, Texas(2008)

2. Salmon, B., Schlosser, S.W., Cranor, L.F., Ganger, G.R.: Perspec-tive: Semanticdata management for the home. In: Proceedingsof 7th USENIX Conference on File and Storage Technologies(FAST). San Francisco, CA (2009)

3. Amazon Simple Storage Service (S3): http://www.amazon.com/s3/

4. Windows Live Mesh: http://www.mesh.com/5. Mozy homepage: http://mozy.com6. Symantec’s Protection Network: http://www.spn.com/

7. Michael, V., Stefan, S., Geoffrey, M.V.: Cumulus: Filesystembackup to the cloud. In: Proceedings of 7th USENIX Conferenceon File and Storage Technologies (2009)

8. Fitzpatrick, B.: Brackup. http://code.google.com/p/brackup/9. Jungle disk: http://www.jungledisk.com/

10. Daniel, P., Jason, F.: EnsemBlue: Integrating distributed storageand consumer electronics. In: Proceedings of 7th Symposium onOperating Systems Design and Implementation (OSDI). Seattle,WA (2006)

11. Preguia, N., Baquero, C., Martins, J.L., Shapiro, M., Paulo, S.,Almeida, Domingos, H., Fonte, V., Duarte, S.: Few: File man-agement for portable devices. In: Proceedings of 1st InternationalWorkshop on Software Support for Portable Storage (IWSSPS).San Francisco, CA (2005)

12. Terry, D.B., Theimer, M.M., Petersen, K., Demers, A.J., Spre-itzer, M.J., Hauser, C.H.: Managing update conflicts in Bayou, aweakly connected replicated storage system. In: Proceedings of15th ACM Symposium on Operating Systems Principles (OSDI).Copper Mountain, CO (1995)

13. Popek, G.J., Guy, R.G., Page, T.W. Jr., Heidemann, J.S.: Replica-tion in Ficus distributed file systems. In: Proceedings of 1th Work-shop on the Management of Replicated Data (WMRD), pp. 20–25.Houston, TX (1990)

14. Kistler, J.J., Satyanarayanan, M.: Disconnected operation in theCoda file system. ACM Trans. Comput. Syst. 10(1), 3–25 (1992)

15. David, K.G., Pierre, J., Mark, A.S., James, W.J.: Semantic file sys-tems. In: Proceedings of 13th ACM Symposium on Operating Sys-tem Principles (SOSP). Pacific Grove, CA (1991)

16. Burra, G., Udi, M.: Integrating content-based access mechanismswith hierarchical file systems. In: Proceedings of 3th USENIXSymposium on Operating Systems Design and Implementation(OSDI). New Orleans, LA (1999)

17. Google desktop web page: http://desktop.google.com18. Beagle web page: http://beagle-project.org19. Dahlia, M., Doug, T.: Concise version vectors in WinFS. In: Pro-

ceedings of 19th International Symposium on Distributed Com-puting (DISC). Cracow, Poland (2005)

20. Sanjay, G., Howard, G., Leung, S.: The Google file system. In:Proceedings of 19th ACM Symposium on Operating SystemsPrinciples, pp. 29–43. Lake George, New York (2003)

21. Borthakur, D.: The Hadoop distributed file system: Archi-tecture and design. http://hadoop.apache.org/core/docs/r0.18.2/hdfs_design.pdf (2007)

22. Corsair Project in Tsinghua University: http://corsair.thuhpc.org/

Weimin Zheng graduated from de-partment of automation control inTsinghua University in 1970 andworked for teaching in Tsinghuafrom then on. He got M.S. degree ofdepartment of computer science andtechnology in 1982. He is now presi-dent of institute of high performancecomputing, managing director of theChinese Computer Society. And heis also a professor and doctor di-rector of institute of high perfor-mance computing. His research in-terests focus on the architecture of

high performance computer system, parallel/distributed processing,developing environment for parallel program and etc.

Cluster Comput (2010) 13: 141–151 151

Pengzhi Xu is a Ph.D. candidateof the Department of ComputerScience and Technology, TsinghuaUniversity, Beijing, China. He re-ceived a B.A. degree in 2003 fromTianjin University, China and a M.S.degrees in 2006 from Tsinghua Uni-versity. He is especially interested indata grid, distributed system and etc.

Xiaomeng Huang received his Ph.D.degree of the Department of Com-puter Science and Technology, Ts-inghua University, Beijing, Chinain 2007. He received a B.A. de-gree in 2000 from Wuhan Univer-sity, Wuhan, China and a M.S. de-gree in 2003 from Huazhong Uni-versity of Science and Technology,Wuhan, China. He is especially in-terested in distributed system, com-puter networks and etc.

Nuo Wu is a Ph.D. candidate of theDepartment of Computer Scienceand Technology, Tsinghua Univer-sity, Beijing, China. He received aB.A. degree in 2004 from HaerbinInstitute of Technology, China. Heis especially interested in data grid,distributed file system and etc.