trusteddb a trusted hardware-based database with privacy

Upload: thebhas1954

Post on 01-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    1/14

    TrustedDB: A Trusted Hardware-BasedDatabase with Privacy and Data Confidentiality

    Sumeet Bajaj and Radu Sion

    Abstract—Traditionally, as soon as confidentiality becomes a concern, data are encrypted before outsourcing to a service provider.

    Any software-based cryptographic constructs then deployed, for server-side query processing on the encrypted data, inherently limit

    query expressiveness. Here, we introduce TrustedDB, an outsourced database prototype that allows clients to execute SQL queries

    with privacy and under regulatory compliance constraints by leveraging server-hosted, tamper-proof trusted hardware in critical query

    processing stages, thereby removing any limitations on the type of supported queries. Despite the cost overhead and performance

    limitations of trusted hardware, we show that the costs per query are orders of magnitude lower than any (existing or) potential future

    software-only mechanisms. TrustedDB is built and runs on actual hardware, and its performance and costs are evaluated here.

    Index Terms—Database architectures, security, privacy, special-purpose hardware

    Ç

    1 INTRODUCTION

    ALTHOUGH   the benefits of outsourcing and clouds arewell known [41], significant challenges yet lie in thepath of large-scale adoption since such services oftenrequire their customers to inherently trust the providerwith full access to the outsourced data sets. Numerousinstances of illicit insider behavior or data leaks have leftclients reluctant to place sensitive data under the control of a remote, third-party provider, without practical assurancesof  privacy  and confidentiality, especially in business, health-care, and government frameworks. Moreover, today’sprivacy guarantees for such services are at best declarative

    and subject customers to unreasonable fine-print clauses.For example, allowing the server operator to use customer behavior and content for commercial profiling or govern-mental surveillance purposes.

    Existing research addresses several such security as-pects, including access privacy and searches on encrypteddata. In most of these efforts, data are encrypted beforeoutsourcing. Once encrypted however, inherent limitationsin the types of primitive operations that can be performedon encrypted data lead to fundamental expressiveness andpracticality constraints.

    Recent theoretical cryptography results provide hope byproving the existence of universal homomorphisms, i.e.,

    encryption mechanisms that allow computation of arbitraryfunctions without decrypting the inputs [43]. Unfortunately,actual instances of such mechanisms seem to be decadesaway from being practical [17].

    Ideas have also been proposed to leverage tamper-proof hardware to privately process data server-side, rangingfrom smart-card deployment [25] in healthcare to moregeneral database operations [23], [32], [26].

    Yet, common wisdom so far has been that trustedhardware is generally impractical due to its performancelimitations and higher acquisition costs. As a result, withvery few exceptions [25], these efforts have stoppedshort of proposing or building full-fledged databaseprocessing engines.

    However, recent insights [9] into the cost-performancetradeoff seem to suggest that things stand somewhatdifferently. Specifically, at scale, in outsourced contexts,computation inside secure processors is orders of magnitudecheaper than any equivalent cryptographic operation per-

    formed on the provider’s unsecuredserver hardware, despitethe overall greater acquisition cost of secure hardware.

    This is so because the overheads for cryptography thatallows some processing by the server on encrypted data areextremely high even for simple operations. This fact isrooted not in cipher implementation inefficiencies butrather in fundamental cryptographic hardness assumptionsand constructs, such as trapdoor functions. Moreover, thisis unlikely to change anytime soon as none of the currentprimitives have, in the past half-century. New mathematicalhardness problems will need to be discovered to allow hopeof more efficient cryptography.

    As a result, we posit that a full-fledged privacy enablingsecure database leveraging server-side trusted hardwarecan be built and run at a fraction of the cost of any (existingor future) cryptography-enabled private data processing oncommon server hardware. We validate this by designingand building TrustedDB, an SQL database processingengine that makes use of tamper-proof cryptographiccoprocessors such as the IBM 4764 [3] in close proximityto the outsourced data.

    Tamper resistant designs, however, are significantlyconstrained in both computational ability and memorycapacity which makes implementing fully featured data- base solutions using secure coprocessors (SCPUs) very

    challenging. TrustedDB achieves this by utilizingcommon unsecured server resources to the maximumextent possible. For example, TrustedDB enables the SCPUto transparently access external storage while preserving

    752 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

    .   The authors are with the Computer Science Department, Stony BrookUniversity, Stony Brook, NY. E-mail: {sbajaj, sion}@cs.stonybrook.edu.

     Manuscript received 13 Jan. 2012; revised 4 July 2012; accepted 2 Feb. 2012;

     published online 20 Feb. 2013.Recommended for acceptance by K.-L. Tan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2012-01-0029.Digital Object Identifier no. 10.1109/TKDE.2013.38.

    1041-4347/14/$31.00    2014 IEEE Published by the IEEE Computer Society

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    2/14

    data confidentiality with on-the-fly encryption. This elim-inates the limitations on the size of databases that can besupported. Moreover, client queries are preprocessed toidentify sensitive components to be run inside the SCPU.Nonsensitive operations are off-loaded to the untrustedhost server. This greatly improves performance and reducesthe cost of transactions.

    Overall, despite the overheads and performance limita-

    tions of trusted hardware, the costs of running TrustedDBare orders of magnitude lower than any (existing or)potential future cryptography-only mechanisms. Moreover,it does not limit query expressiveness.

    The contributions of this paper are threefold: 1) theintroduction of new cost models and insights that explainand quantify the advantages of deploying trusted hardwarefor data processing; 2) the design, development, andevaluation of TrustedDB, a trusted hardware based rela-tional database with full data confidentiality; and 3) detailedquery optimization techniques in a trusted hardware-basedquery execution model.

    2 THE REAL COSTS OF  SECURITY

    As soon as confidentiality becomes a concern, data need to be encrypted before outsourcing. Once encrypted, solutionscan be envisioned that: (A) straightforwardly transfer data back to the client where it can be decrypted and queried,(B) deploy cryptographic constructs server-side to processencrypted data, and (C) process encrypted data server-sideinside tamper-proof enclosures of trusted hardware.

    In this section, we will compare the per-transaction costsof each of these cases. This is possible in view of novelresults of Chen and Sion [9] that allow such quantification.We will show that, at scale, in outsourced contexts,(C) computation inside secure hardware processors isorders of magnitude cheaper than any equivalent crypto-graphic operation performed on the provider’s unsecuredcommon server hardware (B). Moreover, due to theextremely high cost of networking as compared withcomputation, the overhead of transferring even a smallsubset of the data back to the client for decryption andprocessing in (A) is overall significantly more expensivethan (C).

    The main intuition behind this has to do with theamortized cost of CPU cycles in both trusted and common

    hardware, as well as the cost of data transfer. Due toeconomies of scale, provider-hosted CPU cycles are 1-2orders of magnitude cheaper than that of clients and of trusted hardware. The cost of a CPU cycle in trustedhardware (56+ picocents,1 discussed below) becomes thusof the same order as the cost of a traditional client CPUcycle at (e.g., 14-27 picocents for small businesses) includingacquisition and operating costs.

    Additionally, when data are hosted far from theiraccessing clients, the extremely expensive network trafficoften dominates. For example, transferring a single bit of data over a network costs upwards of 3,500 picocents [9].

    Finally, cryptography that would allow processing on

    encrypted data demands extremely large numbers of cycleseven for very simple operations such as addition. This

    limitation is rooted in fundamental cryptographic hardnessassumptions and constructs, such as cryptographic trap-doors, the cheapest we have so far being at least as

    expensive as modular multiplication [31], which comes ata price tag of upwards of tens of thousands of picocents peroperation [9].

    The above insights lead to (C) being a significantly morecost-efficient solution than (A) and (B). We now detail.

    2.1 Cost of Primitives

    Compute cycles and networks. In [9], Chen and Sion derived thecost of compute cycles for a set of environments rangingfrom individual homes with a few PCs (H) to largeenterprises and compute clouds (L) (M, L   ¼   medium-,large-sized business). These costs include a number of 

    factors, such as hardware (server, networking), building(floor space leasing), energy (electricity), service (personnel,maintenance), and so on. Their main thesis is that, dueto economies of scale and favorable operating parameters,per-cycle costs decrease dramatically when run in largecompute providers’ infrastructures.

    The resulting CPU cycle costs (see Fig. 1) range from27 picocents for a small business environment to less thanhalf of a picocent for large cloud providers. Network servicecosts range from a few hundred picocents per bit fornondedicated service to thousands of picocents in the case of medium-sized businesses. Detailed numbers are available

    in [39], [9], [40].Also, the work in [39], [9] derives the cost of x86-

    equivalent CPU cycles inside cloud-hosted SCPUs such asthe IBM 4764 to be 56 picocents. We note that while this isindeed much higher than the 

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    3/14

    data confidentiality. The lower bound cost of queryexecution in this case is as follows2:

    Costunencrypted  ¼ 2 D C bit transmit

    þ  N 

    D 1

    C cycle server  addition

    ð1Þ

    where   N   is the size of the entire database in bits,   D ¼ 32(32 bit integers),  C cycle server   is the cost of one CPU cycle onserver hardware, and   addition ¼ 1  is the average number of CPU cycles required for an addition operation [22].C bit transmit   is the cost of transmitting 1 bit of data fromthe service provider to the client.

    (A) Transferring encrypted data to client.  The first baselinesolution for data confidentiality works by transferring theentire database to the client. The client then decrypts and

    aggregates the data. The cost of this alternative becomes

    Costtransfer  ¼ N  ðC bit transmit þ C bit decryptionÞ

    þ  N 

    D 1

    C cycle client  addition;

    ð2Þ

    where   C bit decryption ¼ 8  picocents is the normalized cost of decrypting 1 bit with AES-128 and C cycle client ¼ 2 picocents isthe cost of a single-client CPU cycle respectively in medium-sized (M) enterprises. Naturally, we observe that here thecost of transferring the database to the client dominates.

    (B) Cryptography.   Traditional additive homomorphisms[33], [28], [29] have been used in existing work [19], [42], toallow servers to run aggregation queries over encrypteddata. These allow the computation of the encryption of thesum of a set of encrypted values without requiring theirdecryption in the process.

    Existing homomorphisms require the equivalent work of at least a modular multiplication in performing theircorresponding operation, such as addition. Moreover, forsecurity, this modular multiplication needs to be performedin fields with a large modulus. For efficiency, [42] goes onestep further and proposes to perform aggregation in parallel by simultaneously adding multiple 32-bit integer values.They achieve this by adding two 1,024-bit chunks of encrypted data at a time. Due to the properties of the

    Paillier cryptosystem, each such addition involves one

    2,048-bit modular multiplication.3

    The server then computes the encrypted sum of all

    such large integers, which is equivalent to a single

    modular multiplication of the encrypted values modulo

    2,048, and returns the result to the client. The client

    decrypts the 2,048 bit result into a 1,024 bit plaintext, splits

    this into 32 integers of 32 bits each, and computes theirsum. The cost of this scheme is given by

    Costhomomorphic  ¼B hD 

      N 

    B h 1

    C modular mul

    þ 2 B h C bit transmit þ C homomorphic dec

    þ  B h

    D  1

    C cycle client  addition

    ;

    ð3Þ

    where B h  ¼ 1,024 is the plaintext block size and  C modular mulis the cost of performing a single modular multiplication

    modulo 2,048 on the server.   C homomorphic dec   is the cost of 

    performing the single decryption on client and involvesmodular multiplication and exponentiation.

    (C) SCPUs.  A possible use of an SCPU is to perform the

    aggregation fully within it. The result can then be

    reencrypted and transmitted back to the client.In addition to the core CPU processing costs (which can

     be computed directly from the cost of SCPU cycles), data

    transfer overheads are incurred, i.e., to bring encrypted data

    into the SCPU and then transfer the encrypted results back

    to the host server. The total cost of the solution becomes

    Costscpu ¼  N 

    B s ð srv C cycle srv þ  scpu C cycle scpuÞþ N  C bit decryption scpu

    þ  N 

    D 1

    C cycle scpu  addition scpu

    þ B c C bit encryption scpu þ B c C bit transmit

    þ B c C bit decyption client;

    ð4Þ

    where  srv  and  scpu are the server and SCPU cycles used to

    setup data transfer and include the cost of setting up and

    handling DMA interrupts.  C cycle scpu   is the cost of an SCPU

    cycle.   B s  ¼ 64 KB   is the block size of data transmitted

     between the server and the SCPU in one round and B c is thecipher block size (128 bits for AES).    addition scpu ¼ 2   is the

    number of cycles per addition operation in the SCPU for

    64 bit addition (on a 32 bit architecture).Fig. 2 shows the cost relationship between the solutions.

    It can be seen that for any data set sizes,   Costscpu <

    Costhomomorphic  and  Costscpu < Costtransfer. We also note that

    for data sets of size  

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    4/14

    2.3 Generalized Argument

    Recall that current cryptographic constructs are based ontrapdoor functions [18]. Currently, viable trapdoors are based on modular exponentiation in large fields (e.g.,2,048 bit modular operations) and viable homomorphismsinvolve a trapdoor for computing the ciphertexts. Addition-ally, the homomorphic operation itself involves processing

    these encrypted values at the server in large fields, whilerespecting the underlying encryption trapdoor, incurring atleast the cost of a modular multiplication [33], [28], [29]. Thisfundamental cryptography has not improved in efficiency indecades and would require the invention of new mathema-tical tools before such improvements are possible.

    Thus, overall, for large-scale, efficient deployments(e.g., clouds) where CPU cycles are extremely cheap(e.g., 0.45 picocents/cycle), performing the cheapest, leastsecure homomorphic operations (modular multiplication)comes at a price tag of at least 30,000 picocents [9] even forvalues as small as 32-bit (e.g., salaries and ZIP codes).

    Thus, even if we assume that in future developmentshomomorphisms will be invented that can allow full TuringMachine languages to be run under the encryptionenvelope, unless new trapdoor math is discovered, eachoperation will yet cost at least 30,000 picocents when run onefficient servers. By comparison, SCPUs process data at acost of 56 picocents/cycle. This is a difference of severalorders of magnitude in cost. We also note that, while ECCsignatures (e.g., even the weak ECC-192) may be faster,ECC-based trapdoors would be even more expensive, asthey would require two point multiplications, coming at aprice tag of least 780,000 cycles (see [11, p. 402]).

    Yet, this is not entirely accurate, as we also need to account

    for the fact that SCPUs need to read data in before processing.The SCPUs considered here feature a decryption throughputof about 10-14 MB/second for AES decryption [3], confirmedalso by our benchmarks. This limits the ability to processdata. For example, comparing two 32-bit integers as in a JOIN operation becomes dominated not by the single-cycleconditional JUMP CPU operation but by the cost of decryption. At 166-200 megacycles/second, this results inthe SCPU having to idly wait anywhere between 47 and80 cycles for decryption to happen in the crypto enginemodule before it can process the data. This in effect results inan amortized SCPU cost of between 2,632 and 4,480 picocents

    (3,556 picocents on average) for each operation whichreduces the above three orders of magnitude difference toonly one order of magnitude, still in favor of SCPUs.4

    The above holds even for the case when the SCPU hasonly enough memory for the two compared values. Further,in the presence of significantly higher, realistic amounts of SCPU memory (e.g.,   M  ¼ 32   MB for 4764-001), optimiza-tions can be achieved for certain types of queries such asrelational JOINs. The SCPU can read in and decrypt entiredata pages instead of single data items and run the JOINquery over as many of the decrypted data pages as wouldfit in memory at one time. This results in significant savings.

    To illustrate, consider a page size of  P   32-bit words and a

    simple JOIN algorithm for two tables of size   N  32-bitintegers each (we are just concerned with the join attribute).Then, the SCPU will perform a number of  ðN=P Þ2 þ ðN=P Þ

    page fetches each involving also a page data decryption at acost of   P   3,556   picocents. Thus, we get a total cost of ðN 

    2

    P   þ N Þ 3;556 þ N 2 56. For reasonable sizes, e.g.,   P  ¼

    M=2=4 ¼ 4   million, this cost becomes   3þ   orders of magnitude lower than the N 2 30;000 picocent cost incurredin the cryptography-based case.

    Cost versus Performance.   Given these   3þ   orders of magnitude cost advantages of the SCPU over cryptogra-phy-based mechanisms, we expect that for the abovediscussed aggregation query mechanism [42], the SCPU’soverall  performance  will also be at least comparable if not better despite the CPU speed handicap. We experimentally

    evaluated this hypothesis and achieved a throughput of about 1.07 million tuples/second for the SCPU. Bycontrast, in [42], best case scenario throughputs range between 0.58 and 0.92 million tuples/second and at muchhigher overall cost.

    Conclusion Summary. Fig. 3 compares SCPU-based queryprocessing with the most ideal cryptography-based me-chanisms employing a single modular multiplication. Notethat such idealistic crypto mechanisms have not beeninvented yet, but even if they were as Fig. 3 illustrates, forlinear processing queries (e.g., SELECT), the SCPU isroughly 1þ  order of magnitude cheaper. For JOIN queries,the SCPU costs drop even further even when assuming no

    available memory. Finally, in the presence of realisticamounts of memory, due to increased overhead amortiza-tion, the SCPU is multiple orders of magnitude cheaper thansoftware-only cryptographic solutions on legacy hardware.

    We note that the above conclusion   may not   apply totargeted niche scenarios. For example, it is entirely possiblethat by maintaining client precomputed data server-side,processing only a predefined set of queries or by supportingminimal query classes (such as only range queries)specifically designed niche solutions may turn out to becheaper than general-purpose full-fledged SCPU-backeddatabases like TrustedDB.

    3 ARCHITECTURE

    Fig. 4a depicts the TrustedDB architecture. In the following,we discuss some of the key elements.

    BAJAJ AND SION: TRUSTEDDB: A TRUSTED HARDWARE-BASED DATABASE WITH PRIVACY AND DATA CONFIDENTIALITY 755

    4. The cost can be reduced further by up to 50 percent if instead of AES, acipher is built using a faster cryptographic hash as a pseudorandomfunction [6].

    Fig. 3. SCPU is one to three orders of magnitude cheaper thancryptography.

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    5/14

    Overview.   To overcome SCPU storage limitations, the

    outsourced data are stored at the host provider’s site. Query

    processing engines are run on both the server and in the

    SCPU. Attributes in the database are classified as being

    either public or private. Private attributes are encrypted and

    can only be decrypted by the client or by the SCPU.Since the entire database resides outside the SCPU, its

    size is not bound by SCPU memory limitations. Pages that

    need to be accessed by the SCPU-side query processing are

    pulled in on demand by the Paging Module.

    Query execution entails a set of stages:

    1.   In the first stage, a client defines a database schemaand partially populates it. Sensitive attributes aremarked using the SENSITIVE keyword which theclient layer transparently processes by encryptingthe corresponding attributes:

    CREATE TABLE customer(ID integer

    primary key,

    Name char(72) SENSITIVE, Address

    char(120) SENSITIVE).

    2.   Later, a client sends a query request to the hostserver through a standard SQL interface. The queryis transparently encrypted at the client site using thepublic key of the SCPU. The host server thus cannotdecrypt the query.

    3.   The host server forwards the encrypted query to theRequest Handler inside the SCPU.

    4.   The Request Handler decrypts the query andforwards it to the Query Parser. The query is parsedgenerating a set of plans. Each plan is constructed byrewriting the original client query into a set of subqueries, and, according to their target data set

    classification, each subquery in the plan is identifiedas being either public or private.

    5.   The Query Optimizer then estimates the executioncosts of each of the plans and selects the best plan

    (one with least cost) for execution forwarding it tothe dispatcher.

    6.   The Query Dispatcher forwards the public queries tothe host server and the private queries to the SCPUdatabase engine while handling dependencies. Thenet result is that the maximum possible work is runon the host server’s cheap cycles.

    7.   The final query result is assembled, encrypted,digitally signed by the SCPU Query Dispatcher,and sent to the client.

    Query parsing and execution. Sensitive attributes can occur

    anywhere within a query, e.g., in SELECT, WHERE, orGROUP-BY clauses, in aggregation operators, or withinsubqueries. The Query Parser’s job is then:

    1.   To ensure that any processing involving privateattributes is done within the SCPU. All privateattributes are encrypted using a shared data encryp-tion keys between the client and the SCPU; hence,the host server cannot decipher these attributes (seeSection 5).

    2.   To optimize the rewrite of the client query such thatmost of the work is performed on the host server.

    To exemplify how public and private queries aregenerated from the original client query, we use examplesfrom the TPC-H benchmark [2]. TPC-H does not specify anyclassification of attributes based on security. Therefore, wedefine a attribute set classification into private (encrypted)and public (nonencrypted). In brief, all attributes thatconvey identifying information about customers, suppliers,and parts are considered private. The resulting query plans,including rewrites into main CPU and SCPU componentsfor TPC-H queries Q3 and Q6 are illustrated in Fig. 4.

    For queries that have WHERE clause conditions onpublic attributes, the server can first SELECT all the tuplesthat meet the criteria. The private attributes’ queries are

    then performed inside the SCPU on these intermediateresults, to yield the final result. For example, query Q6 of the TPC-H benchmark is processed as shown in Fig. 4b. Thehost server first executes a public query that filters all tuples

    756 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

    Fig. 4. TrustedDB architecture and query plans. Green and Red indicate public and private attributes, respectively.

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    6/14

    which fall within the desired  shipdate  and  quantity  range, both of these being public attributes. The result from thispublic query is then used by the SCPU to perform theaggregation on the private attributes   extendedprice   anddiscount. While performing the aggregation the privateattributes are decrypted inside the SCPU. Since theaggregation operation results in a new attribute composingof private attributes it is reencrypted within the SCPU before sending to the client.

    Note that the execution of private queries depends onthe results from the execution of public queries and vice-a-versa even though they execute in separate databaseengines. This is made possible by the TrustedDB QueryDispatcher in conjunction with the Paging Module.

    Data manipulation queries (INSERT, UPDATE) alsoundergo a rewrite. Moreover, any new generated attributevalues are reencrypted within the SCPU before updatingthe database. For illustration, consider the query   UPDATEEMPLOYEES SET SALARY  ¼   SALARY  þ  2000 WHERE ZIP¼   98239. If   SALARY  is a private attribute, then the query

    works by first decrypting the  SALARY attribute, performingthe addition and then reencrypting the updated values andis executed within the SCPU. On the other hand, if  SALARYis a public attribute, then the query will be executed entirely by the host server.

    We refer the reader to [39] for more detailed explanationof query processing including group by and nested queries.In this work, we focus on query optimization techniques ina trusted hardware model.

    3.1 Query Optimization

    3.1.1 Model 

    As per Section 3, due to the unavailability of storage withinthe SCPU, the entire database is stored at the server.However, the attribute classification into public (none-ncrypted) and private (encrypted) introduces a strictvertical partitioning (although logical) of the database between the server and the SCPU. The requirement to beadhered to is that any processing on private attributes must be done within the confinements of the SCPU. Thispartitioning of data resembles a federated database ratherthan a stand-alone DBMS. Since this partitioning is dictated by the security requirements of the application and all datareside on the server, existing techniques [35], [14] are ruledout. The following sections describe query optimization inTrustedDB within the scope of this logical partitioning.

    3.1.2 Overview 

    At a high-level query optimization in a database systemworks as follows:

    1.   The  Query Plan Generator   constructs possibly multi-ple plans for the client query.

    2.   For each constructed plan the  Query Cost Estimatorcomputes an estimate of the execution cost of that plan.

    3.   The best plan, i.e., one with the least cost, is then

    selected and passed on to the  Query Plan Interpretorfor execution.

    The query optimization process in TrustedDB workssimilarly with key differences in the  Query Cost Estimator

    due to the logical partitioning of data mentioned above.However, we note that all optimizations possible in atraditional DBMS with no private attributes are stillapplicable to public subqueries executed on the server.We refer the reader to [24] for details of these existingoptimizations.

    In the following sections, we only discuss cases thatare unique to TrustedDB and trusted hardware-baseddesigns alike.

     Metric.  The key in query optimization is estimating thecosts of various logical query plans. A common metricutilized in comparing query plan costs has been disk I/O[38], [36] which is justified since disk access is the mostexpensive operation and should be minimized. In thetrusted hardware model of TrustedDB, an additionalsignificant I/O cost is introduced, i.e., the server$SCPUdata transfer. Moreover, disk access on the server and theServer$SCPU communication have different costs. Inaddition, we also need to consider the disparity betweenthe computational abilities of the server and the SCPU. Tocombine all these factors, we use execution time as themetric for cost estimation. Note that from this pointonwards, any reference to the cost of a query plan refersto its execution time.

    Also, the goal of the   Query Optimizer   is not to measurequery execution times with high accuracy but only tocorrectly compare query plans based on an estimation. Toclarify, assume that a query Q has two valid execution plansP A  and  P B . Then, if the real execution times of  P A  and  P B are such that ET  realðP AÞ > ET  realðP B Þ, then it suffices for theQuery Optimizer to estimate ET  estðP AÞ > ET  estðP B Þ althoughthe values for  ET  realðP iÞ and  ET  estðP iÞ  may not be close.

     Approach.   To illustrate the query optimization in

    TrustedDB, we take the following approach for each case:

    1.   First, we present two alternative plans for the case.2.   Next, we estimate the execution times (ET  ) for each

    of the two plans.3.   We analyze the estimations of step 2 for selection of 

    the best plan.

    Then, in Section 4, we experimentally verify queryoptimization techniques.

    Note that in the running system multiple (>2) planscould be considered. We limit ourselves to only two here,for brevity. Also, steps 1-3 from above are performed by the

    TrustedDB Query Optimizer  (see Fig. 4a).3.1.3 System Catalog 

    Any query plan is composed of multiple individualexecution steps. To estimate the cost of the entire plan, itis essential to estimate the cost of individual steps andaggregate them. To estimate these costs, the   Query CostEstimator   needs access to some key information. Forexample, the availability of an index or the knowledge of possible distinct values of an attribute. These sets of information are collected and stored in the  System Catalog.Most available DBMS today have some form of periodicallyupdated System Catalog. Figs. 5 and 6 and Tables 1, 2, and 3

    give a partial view of the System Catalog maintained byTrustedDB. Later, in Section 3.1.5, we will see how thisinformation is used in estimating plan execution times.System Catalog content is categorized as follows:

    BAJAJ AND SION: TRUSTEDDB: A TRUSTED HARDWARE-BASED DATABASE WITH PRIVACY AND DATA CONFIDENTIALITY 757

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    7/14

    1.   System configuration (see Fig. 5).   These are theavailable compute capacities of the system hard-ware. This information is unlikely to change fre-quently and is configured during setup.

    2.   Benchmarked parameters (see Table 1). As part of queryexecution, many basic operations are performedwhich add to the overall execution time. Benchmarksare employed to determine the average executiontimes for these operations to aid in query costestimation. Unless changes occur in the system

    configuration, this information need not be updated.3.   Database configuration (see Fig. 6).   These parameters

    are directly set on the database to improve perfor-mance. The server DBMS is an off-the-shelf indus-trial quality database system (see Section 4) whichprovides large range of configuration parameters.With the TrustedDB design, this host DBMS can beconfigured independently.

    4.   Data Statistics (see Table 2). A data scan is employedto collect statistics about the actual data. This scan isconfigured to run periodically. The statistics onpublic attributes are collected server-side. However,

    private attributes are scanned via the SCPU, de-crypted, and then analyzed. Note that since thecollection process involves scan of the database, it isa time-consuming task and needs to be scheduledaccordingly (e.g., nightly or weekends).

    3.1.4 Analysis of Basic Query Operations 

    The cost of a plan is the aggregate of the cost of the stepsthat comprise it. In this section, we present how executiontimes for a certain set of basic query plan steps areestimated. These steps are reused in multiple plans, andhence, we group their analysis here.

    Index-based lookup.   Consider the selection query   Q ¼

     l shipdate¼10=01=1998. As per the System Catalog (see Table 2),there is a   B þ-Tree index available on the attributel shipdate. Hence, the expected execution time to locatethe first leaf page containing the result tuples will be

    ETðQÞ ¼ log ’l s:   ð5Þ

    Here,  log ’l  is number of index pages read, while s  is the

    time to read a single page from disk on server.Selection.   Estimating the execution time of selection

    queries requires estimation of the number of tuples thatwould comprise the query result. For this, we first definethe following from [24].

    Definition 1 (Values).   The   values   of an attribute   A,

    V aluesðAÞ  or  A, is the number of distinct values of  A.

    Definition 2 (Reduction factor).   The   reduction factor   of a

    condition C ,  ReductionðCÞ  or  C , is the reduction in size of the relation caused by the execution of a selection clause with

    that condition.

    For example,

    Reductionðl shipdate ¼ 10=01=1998Þ ¼  1

    V aluesðl shipdateÞ

    OR

    l shipdate¼10=01=1998  ¼  1

    lsd :

    ð6ÞFurther,

    l shipdate>10=01=1998  ¼#lsd  

    0 10=01=19980

    lsd :   ð7Þ

    Note that the above definitions assume a uniform distribu-

    tion of attribute values, i.e., each distinct attribute value is

    equally likely to occur in a relation.Now, consider the selection query   l shipdate>10=01=1998. In

    its execution, the index on l shipdate is used to locate the first

    leaf page containing the result. Then, subsequent leaf pages

    are scanned to gather all tuples comprising the result. Hence,the estimated execution time is  log ’l s þlsd ’l

    !l s. Here,

    the term lsd ’l

    !lestimates the number of leaf pages containing

    all tuples that satisfy the query.

    758 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

    Fig. 5. System Configuration parameters.

    Fig. 6. Database Configuration.

    TABLE 1Benchmarked Parameters

    TABLE 2Collected Data Statistics

    TABLE 3Collected Relation-Level Statistics

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    8/14

    Server$SCPU data transfer. The intermediate results from

    query plan execution are often transferred between the

    server and the SCPU. This data transfer occurs in fixed-sized

    pages in a synchronous fashion. If the data to be transferred

    are B  bytes, then the total transfer time is d   B 1024e . Here, 

    is the page size, and hence, d   B 1024e gives the number of pages

    needed to transfer B  bytes. is the time required to transfer a

    single page of size     (see Table 1 and Fig. 6 for values).

    Suppose that we need to transfer the results of the querysumðl quantityÞð l shipdate>10=01=1998 and l linestatus¼‘O0Þ. Then, we

    can estimate the intermediate query result size by multi-

    plying the number of tuples in the query result with the total

    size of the projection operation. The size of the projection is

    simply the sum of the sizes of the individual attributes

    l linestatus   and   l quantity. Hence, the total data transfer

    time for this query is estimated as ðllsþlqt Þlsd ’l

    1024  .

    External sorting.   Since database relations can require

    large amount of storage space external sorting is employed

    whenever the relation(s) to be sorted cannot fit in memory.

    External sorting has been studied extensively [24] and theI/O c ost for a n ext er na l mer ge sor t i s g iv en a s

    2 F   logM 1F , where   F   is the total number of relation

    pages and  M   is the number of pages that can be stored in

    memory (M F ). Using this, we can estimate the executiontime for sorting a relation  r  on the server as

    2   ’r  r1;024

      logs 1;024   1

    ’r  r1;024

    s   ð8Þ

    and the time for the same sort from within the SCPU as

    2   ’r  r1;024

      logt 1;024   1

    ’r  r1;024

    ðs þ Þ:   ð9Þ

    Here, ’r  is the total number of tuples in  r  and   r  is the sizeof an individual tuple.

    3.1.5 Plan Evaluations 

    Case I: Group-By on public attribute.   Fig. 7a shows twoalternative plans for a Group-By operation on the publicattribute  l shipmode. The difference between the two plans(A and B) is whether the grouping is performed by theserver or the SCPU. If it is performed by the server, then thecheap server cycles are utilized. However, if done withinthe SCPU, the SCPU!Server data transfer is reduced, thereduction depending upon the number of distinct values of 

    l shipmode. Table 4 shows the computation of the executiontimes of both Plans A and B and the actual estimation. It isobserved that under the parameters and System catalogdata from Figs. 5 and 6 and Tables 1, 2, and 3, it is more

    BAJAJ AND SION: TRUSTEDDB: A TRUSTED HARDWARE-BASED DATABASE WITH PRIVACY AND DATA CONFIDENTIALITY 759

    Fig. 7. Query optimization plans for Cases I-IV of Section 3.1.5. Green and Red indicate public and private attributes, respectively.

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    9/14

    efficient to perform the Group By on the server. This is because the selection on   l shipdate has high selectivity andminimizes the data transfer cost. If this selection operationhad low selectivity, then aggregation within the SCPU

    would be less expensive.Case II: Order-By on public and private attributes.   If an

    Order-By clause has a public attribute followed by a privateattribute, the server can first order the intermediate resultson the public attribute leaving the private ordering tothe SCPU (see Fig. 7b, Plan A). Or, the SCPU can process theentire Order-By clause (see Fig. 7b, Plan B). Under thespecific data statistics, Plan A is preferred. The reason forthis is that in Plan A at one time, the SCPU has to ordertuples having the same value for the attribute  l shipmode.The size of this intermediate result being small the sortoperation is more efficient. Note that the SCPU employsexternal sorting. Hence, any reduction in the size of 

    intermediate results directly lowers the Server$SCPUtransfer cost.

    Case III: Distinct clause on public and private attributes.  Adistinct clause can be processed either by using a hash tableor by first sorting the input [24]. Here, we analyze the laterapproach. Similar to the Order-By case the first option (seeFig. 7c, Plan A), here is for the server to sort the intermediateresults on the public attribute   l shipmode   and then havethe SCPU process the distinct clause. The second option(see Fig. 7c, Plan B) is for the SCPU to sort and process thedistinct entirely. As seen from Table 5, the optimizer prefersPlan A. The number of distinct values of   l shipmode play a

    critical role in plan selection here. In the data set,  l shipmodehas a high number of distinct values which means that afterthe sort on server, the SCPU only operates on a smallportion of data at a time. A small number of unique

    l shipmode   values instead would favor Plan B since theadvantage of sorting on the server will be reduced.

    Case IV: Projections.   When the number of projectedattributes in a query is high, the server need not transfer

    all these attributes to the SCPU for evaluation of privateselection operations (see Fig. 7d, Plan B). The alternative(see Fig. 7d, Plan A) is to pass all attributes to the SCPU andthen back to the server after the selection operation withinthe SCPU is performed. However, in Plan B, the serverneeds to perform additional lookups to locate the corre-sponding projected attributes for each tuple in the result set.To optimize the final lookup join, the server first sorts theintermediate results received from the SCPU on the tupleprimary key. This greatly reduces the disk operationsthereby making Plan B more efficient.

    4 EXPERIMENTS

    Setup. The SCPU of choice is the IBM 4764-001 PCI-X withthe 3.30.05 release toolkit featuring 32MB of RAM and aPowerPC 405GPr at 233 MHz. The SCPU sits on the PCI-X bus of an Intel Xeon 3.4 GHz, 4GB RAM Linux box (kernel2.6.18). The server DBMS is a standard MySQL 14.12 Distrib5.0.45 engine. The SCPU DBMS is a heavily modified SQLitecustom port to the PowerPC. The entire TrustedDB stack(see Fig. 4a) is written in C.

    TPC-H query load. To evaluate the runtime of generalizedqueries, we chose several queries from the TPC-H set [2] of 

    varying degrees of difficulty and privacy. The TPC-H scalefactor is 1, i.e., the database size is  1 GB.

    Fig. 8a shows the execution times compared to asimple unencrypted MySQL setup. Fig. 8b also depicts the

    760 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

    TABLE 4Cost Computations for Query Plans Shown in Figs. 7a and 7b Corresponding to Cases I and II

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    10/14

     breakdown of times spent in execution of the public andprivate subqueries. The execution times of private queriesinclude the time required for encryption and decryptionoperations inside the SCPU. The public queries executed onthe host server also include the processing times to interfacethe TrustedDB stack with the server database engine and

    output the final results.As can be seen, when compared with the completelyunsecured baseline scenario, security does not come cheapwith execution times being higher by factors between1.03 and 10. These factors benefit from TrustedDB’sleveraging of the untrusted server’s CPU for nonsensitivequery portions. However, recall from Section 2 that theactual costs are orders of magnitude lower than anysolution based on software-only cryptography on legacyserver hardware.

    Updates.   Fig. 8c shows the latencies for insert andupdate statements. The reported times are for a random

    insert/update of a single tuple in the lineitems relationaveraged over 10 runs.

    Query optimization. In Section 3.1, we presented differentquery plans, analyzed their execution, and showed how theoptimizer computed their execution times. Tables 4 and 5summarized the theoretical costs and estimated execution

    times. To verify whether the plan selected in each of thecases is indeed the best plan, we executed each of the planson the TPC-H data set and measured their execution times.The results are in Fig. 9a. We find that in each of theCases I-IV, the following holds. If  ET  estðP AÞ > ET  estðP B Þ  inTable 4 or 5, then  ET  realðP AÞ > ET  realðP B Þ  in Fig. 9a. For amore detailed evaluation, we compare the estimated andmeasured times for varying selectivity of the publicattribute   l shipdate   in Case I. Note that this selectivitydirectly influences the amount of Server$SCPU datatransfer and thus the overall processing costs. As seen inFig. 9b, the optimizer correctly estimates which plan would

    BAJAJ AND SION: TRUSTEDDB: A TRUSTED HARDWARE-BASED DATABASE WITH PRIVACY AND DATA CONFIDENTIALITY 761

    Fig. 8. TPC-H query execution times, time profiles, and latencies for DML operations.  Qi  ¼ ith  query from TPC-H [2].

    TABLE 5Cost Computations for Query Plans Shown in Figs. 7c and 7d Corresponding to Cases III and IV

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    11/14

    have lower execution time for most of the cases. Fig. 9cshows the results for very low selectivity of   l shipdate. Atlow selectivity the accuracy of estimation lowers. Thereare two reasons for this: 1) The measured times vary by3:5 ms between runs. Thus, when the estimated times for

    two plans differ by 

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    12/14

    Limitations.   The TrustedDB query parser does not yetsupport parsing of multilevel nested subqueries and userdefined views.

    6 RELATED WORK

    Queries on encrypted data.   Hacigumus et al. [20] proposedivision of data into secret partitions and rewriting of range

    queries over the original data in terms of the resultingpartition identifiers. This balances a tradeoff between clientand server-side processing, as a function of the datasegment size. In [21], the authors explore optimal bucketsizes for range queries.

    Damiani et al. [12] propose using tuple-level encryptionand indexes on the encrypted tuples to support equalitypredicates. The main contribution here is the analysis of attribute exposure caused by query processing leading totwo insights. 1) the attribute exposure increases with thenumber of attributes used in an index, and 2) the exposuredecreases with the increase in database size. Range queries

    are processed by encrypting individual   B þ Tree   nodesand having the client, in each query processing step,retrieve a desired encrypted   B þ Tree   node from theserver, decrypt and process it. However, this leads tominimal utilization of server resources thereby under-mining the benefits of outsourcing. Moreover, transfer of entire   B þ Tree   nodes to the client results in significantnetwork costs.

    Wang and Lakshmanan [44] employs  Order Preservingencryption for querying encrypted XML databases. Inaddition, a technique referred to as splitting and scaling isused to differ the frequency distribution of encrypted data

    from that of the plaintext data. Here, each plaintext value isencrypted using multiple distinct keys. Then, correspond-ing values are replicated to ensure that all encrypted valuesoccur with the same frequency thereby thwarting anyfrequency-based attacks.

    Wang et al. [45] use a salted version of IDA scheme tosplit encrypted tuple data among multiple servers. Inaddition, a secure   B þ Tree   is built on the key attribute.The client utilizes the   B þ Tree   index to determine theIDA matrix columns that need to be accessed for dataretrieval. To speed up client-side processing and reducenetwork overheads, it is suggested to cache parts of theB þ Tree  index client-side.

    Vertical partitioning of relations amongst multiple un-trusted servers is employed in [15]. Here, the privacy goal isto prevent access of a subset of attributes by any singleserver. For example, {Name, Address} can be a privacysensitive access-pair and query processing needs to ensurethat they are not jointly visible to any single server. The clientquery is split into multiple queries wherein each subqueryfetches the relevant data from a server and the clientcombines results from multiple servers. Aggarwal et al. [4]also use vertical partitioning in a similar manner and for thesame privacy goal, but differs in partitioning and optimiza-tion algorithms. TrustedDB is equivalent to both [15], [4]

    when the size of the privacy subset is one and hence a singleserver suffices. In this case, each attribute column needsencryption to ensure privacy [10]. Hence, Ganapathy et al.[15] and Aggarwal et al. [4] can utilize TrustedDB to

    optimize for querying encrypted columns since otherwisethey rely on client-side decryption and processing.

    Ciriani et al. [10] introduce the concept of logicalfragments to achieve the same partitioning effect as in[15], [4] on a single server. A fragment here is simply arelation wherein attributes not desired to be visible in thatfragment are encrypted. TrustedDB (and other solutions)are in effect concrete mechanisms to efficiently query anyindividual fragment from [10]. The work in [10], on theother hand, can be used to determine the set of attributesthat should be encrypted in TrustedDB.

    Ge and Zdonik [16] propose an encryption scheme in atrusted-server model to ensure privacy of data residing ondisk. The FCE scheme designed here is equivalently secureas a block cipher, however, with increased efficiency. Popa[30], like [16], only ensures privacy of data residing on disk.In order to increase query functionality, a layered encryp-tion scheme is used and then dynamically adjusted (byrevealing key to the server) according to client queries.TrustedDB, on the other hand, operates in an untrustedserver model, where sensitive data are protected, both ondisk and during processing.

    Data that are encrypted on disk but processed in clear (inserver memory) as in [16], [30] compromise privacy duringthe processing interval. In [8], the disclosure risks in suchsolutions are analyzed. Canim et al. [8] also propose a newquery optimizer that takes into account both performanceand disclosure risk for sensitive data. Individual data pagesare encrypted by secret keys that are managed by a trustedhardware module. The decryption of the data pages andsubsequent processing is done in server memory. Hence,the goal is to minimize the lifetime of sensitive data and

    keys in server memory after decryption. In TrustedDB,there is no such disclosure risk since decryptions areperformed only within the SCPU.

    Aggregation queries over relational databases is providedin [19] by making use of homomorphic encryption based onPrivacy Homomorphism [33]. The authors in [13] havesuggested that this scheme is vulnerable to a ciphertext onlyattack. Instead, Mykletun and Tsudik [13] propose analternative scheme to perform aggregation queries basedon bucketization [20]. Here, the data owner precomputesaggregate values such as SUM and COUNT for partitionsand stores them encrypted at the server. Although this makes

    processing of certain queries faster, it does not significantlyreduce client side processing.

    Ge and Zdonik [42] discuss executing aggregationqueries with confidentiality on an untrusted server. Dueto the use of extremely expensive homomorphisms [28],[29], this scheme leads to impractically large costs bycomparison, for any reasonable security parameter choices.This is discussed in more detail in Section 2.

    Above solutions are specialized for certain types of queryoperations on encrypted data: [12] for equality predicates,[20], [45], [46] for range predicates, and [19], [42] foraggregation. In TrustedDB, all decryptions are performed

    within the secure confinements of the SCPU, therebyprocessing is done on plaintext data. This removes anylimitation on the nature of predicates that can now beemployed on encrypted attributes including arbitrary user

    BAJAJ AND SION: TRUSTEDDB: A TRUSTED HARDWARE-BASED DATABASE WITH PRIVACY AND DATA CONFIDENTIALITY 763

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    13/14

    defined functions. We note that certain solutions designedfor a very specific set of predicates can be more efficientalbeit at the loss of functionality.

    Trusted hardware. In [5], SCPUs are used to retrieve X509certificates from a database. However, this only supportskey based lookup. Each record has a unique key and a clientcan query for a record by specifying the key. Smith andSafford [34] uses multiple SCPUs to provide key-basedsearch. The entire database is scanned by the SCPUs toreturn matching records.

    Agrawal et al. [32] implement arbitrary joins by readingthe entire database through the SCPU. Such as approach isclearly not practical for real implementations since it islower bounded by the Server$SCPU bandwidth (10 Mbpsin our setup).

    Chip-Secured Data Access [25] uses a smart card forquery processing and for enforcing access rights. The clientquery is split such that the server performs majority of thecomputation. The solution is limited by the fact that theclient query executing within the smart card cannot

    generate any intermediate results since there is no storageavailable on the card. In follow-up work, GhostDB [27]proposes to embed a database inside a USB key equippedwith a CPU. It allows linking of private data carried on theUSB Key and public data available on a server. GhostDBensures that the only information revealed to a potential spyis the query issued and the public data accessed.

    Both [25] and [27] are subject to the storage limitations of trusted hardware which in turn limits the size of thedatabase and the queries that can processed. In contrast,TrustedDB uses external storage to store the entire databaseand reads information into the trusted hardware as needed

    which enables it to be used with large databases. Moreover,database pages can be swapped out of the trusted hardwareto external storage during query processing.

    In [7], a database engine is proposed inside an SCPU fordata sharing and mining. The SCPU fetches data fromexternal sources using secure jdbc connections. The entiredata are treated as private with queries completelyexecuted inside the coprocessor. We find that using theIBM 4764 for processing queries entirely within the trustedhardware module, without utilizing server CPU cycles, isup to  40   slower than traditional server query processing.This is so even when the trusted hardware has access tothe local server file system using our  Paging Module   (seeSection 3). Hence, using jdbc connections as in [7] can onlyhave higher processing overheads.

    7 CONCLUSIONS

    This paper’s contributions are threefold: 1) the introductionof new cost models and insights that explain and quantifythe advantages of deploying trusted hardware for dataprocessing, 2) the design and development of TrustedDB, atrusted hardware based relational database with full dataconfidentiality and no limitations on query expressiveness,and 3) detailed query optimization techniques in a trusted

    hardware-based query execution model.This work’s inherent thesis is that, at scale, in outsourced

    contexts, computation inside secure hardware processors isorders of magnitude cheaper than equivalent cryptography

    performed on provider’s unsecured server hardware,despite the overall greater acquisition cost of securehardware. We thus propose to make trusted hardware afirst-class citizen in the secure data management arena.Moreover, we hope that cost-centric insights and architec-tural paradigms will fundamentally change the waysystems and algorithms are designed.

    REFERENCES[1]   FIPS PUB 140-2, Security Requirements for Cryptographic Modules,

    http://csrc.nist.gov/groups/STM/cmvp/standards.html#02,2013.

    [2]   TPC-H Benchmark, http://www.tpc.org/tpch/, 2013.[3]   IBM 4764 PCI-X Cryptographic Coprocessor,   http://www-03.

    ibm.com/security/cryptocards/pcixcc/overview.shtml, 2007.[4]   G. Aggarwal, M. Bawa, P. Ganesan, H. Garcia-Molina, K.

    Kenthapadi, R. Motwani, U. Srivastava, D. Thomas, and Y. Xu,“Two Can Keep a Secret: A Distributed Architecture for SecureDatabase Services,”   Proc. Conf. Innovative Data Systems Research(CIDR), pp. 186-199, 2005.

    [5]   A. Iliev and S.W. Smith, “Protecting Client Privacy with TrustedComputing at the Server,”  IEEE Security and Privacy,  vol. 3, no. 2,

    pp. 20-28, Mar./Apr. 2005.[6]   M. Bellare, “New Proofs for NMAC and HMAC: Security Without

    Collision-Resistance,”   Proc. 26th Ann. Int’l Conf. Advances inCryptology, pp. 602-619, 2006.

    [7]   B. Bhattacharjee, N. Abe, K. Goldman, B. Zadrozny, C. Apte, V.R.Chillakuru, and M. del Carpio, “Using Secure Coprocessors forPrivacy Preserving Collaborative Data Mining and Analysis,”Proc. Second Int’l Workshop Data Management on New Hardware(DaMoN ’06), 2006.

    [8]   M. Canim, M. Kantarcioglu, B. Hore, and S. Mehrotra, “BuildingDisclosure Risk Aware Query Optimizers for Relational Data-

     bases,”   Proc. VLDB Endowment,  vol. 3, nos. 1/2, pp. 13-24, Sept.2010.

    [9]   Y. Chen and R. Sion, “To cloud or Not to Cloud?: Musings onCosts and Viability,”   Proc. Second ACM Symp. Cloud Computing(SOCC ’11),  pp. 29:1-29:7, 2011.

    [10]   V. Ciriani, S.D.C. di Vimercati, S. Foresti, S. Jajodia, S. Paraboschi,and P. Samarati, “Combining Fragmentation and Encryption toProtect Privacy in Data Storage,”   ACM Trans. Information andSystem Security,  vol. 13, no. 3, pp. 22:1-22:33, July 2010.

    [11]   T. Denis,  Cryptography for Developers,   Syngress, 2007.[12]   E. Damiani, C. Vimercati, S. Jajodia, S. Paraboschi, and P.

    Samarati, “Balancing Confidentiality and Efficiency in UntrustedRelational DBMSs,”  Proc. 10th ACM Conf. Computer and Commu-nications Security (CCS ’12),  2003.

    [13]   E. Mykletun and G. Tsudik, “Aggregation Queries in theDatabase-as-a-Service Model,”   Proc. 20th IFIP WG 11.3 WorkingConf. Data and Applications Security,  pp. 89-103, 2006.

    [14]   F.N. Afrati and V. Borkar, and M. Carey, and N. Polyzotis, and J.D.Ullman, “Map-Reduce Extensions and Recursive Queries,”  Proc.14th Int’l Conf. Extending Database Technology (EDBT), pp. 1-8, 2011.

    [15]   V. Ganapathy, D. Thomas, T. Feder, H. Garcia-Molina, and R.Motwani, “Distributing Data for Secure Database Services,”  Proc.Fourth Int’l Workshop Privacy and Anonymity in the Information Soc.(PAIS ’11), pp. 8:1-8:10, 2011.

    [16]  T. Ge and S. Zdonik, “Fast Secure Encryption for Indexing in aColumn-Oriented DBMS,”   Proc. IEEE 23rd Int’l Conf. Data Eng.(ICDE), 2007.

    [17]   R. Gennaro, C. Gentry, and B. Parno, “Non-Interactive VerifiableComputing: Outsourcing Computation to Untrusted Workers,”Proc. 30th Ann. Conf. Advances in Cryptology (CRYPTO ’10),pp. 465-482, 2010.

    [18]   O. Goldreich,   Foundations of Cryptography I.   Cambridge Univ.Press, 2001.

    [19]   B.I.H. Hacigumus and S. Mehrotra, “Efficient Execution of Aggregation Queries over Encrypted Relational Databases,”  Proc.Ninth Int’l Conf. Database Systems for Advanced Applications,

    vol. 2973, pp. 633-650, 2004.[20]  H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra, “Executing SQLover Encrypted Data in the Database-Service-Provider Model,”Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’02),pp. 216-227, 2002.

    764 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

  • 8/9/2019 TrustedDB a Trusted Hardware-Based Database With Privacy

    14/14

    [21]   B. Hore, S. Mehrotra, and G. Tsudik, “A Privacy-Preserving Indexfor Range Queries,”   Proc. 13th Int’l Conf. Very Large Data Bases(VLDB ’04),  2004.

    [22]   Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel,Santa Clara, CA, 2008.

    [23]   M. Kantarcioglu and C. Clifton, “Security Issues in QueryingEncrypted Data,” Proc. 19th Ann. IFIP WG 11.3 Working Conf. Dataand Applications Security (DBSec ’05),  pp. 325-337, 2005.

    [24]   P. Lewis, A. Bernstein, and M. Kifer,   Databases and TransactionProcessing. Addison-Wesley, 2002.

    [25]   L. Bouganim and P. Pucheral, “Chip-Secured Data Access:Confidential Data on Untrusted Server,”   Proc. 28th Int’l Conf.Very Large Data Bases (VLDB ’02),  pp. 131-141, 2002.

    [26]   E. Mykletun and G. Tsudik, “Incorporating a Secure Coprocessorin the Database-as-a-Service Model,”  Proc. Innovative Architectureon Future Generation High-Performance Processors and Systems(IWIA ’05),  pp. 38-44, 2005.

    [27]   N. Anciaux, M. Benzine, L. Bouganim, P. Pucheral, and D. Shasha,“GhostDB: Querying Visible and Hidden Data Without Leaks,”Proc. 26th Int’l ACM Conf. Management of Data (SIGMOD),  2007.

    [28]   P. Paillier, “Public-Key Cryptosystems Based on CompositeDegree Residuosity Classes,”   Proc. 17th Int’l Conf. Theory and

     Application of Cryptographic Techniques (EUROCRYPT ’99),  1999.[29]   P. Paillier, “A Trapdoor Permutation Equivalent to Factoring,”

    Proc. Second Int’l Workshop Practice and Theory in Public Key

    Cryptography (PKC ’99),  pp. 219-222, 1999.[30]   R.A. Popa, C. Redfield, and N. Zeldovich, “Cryptdb: ProtectingConfidentiality with Encrypted Query Processing,”   Proc. 23rd

     ACM Symp. Operating Systems Principles (SOSP ’11), 2011.[31]   M.O. Rabin, “Digitalized Signatures and Public-Key Functions as

    Intractable as Factorization,” technical report, Massachusetts Inst.of Technology, 1979.

    [32]  R. Agrawal, D. Asonov, M. Kantarcioglu, and Y. Li, “Sovereign Joins,” Proc. 22nd Int’l Conf. Data Eng.,  p. 26, 2006.

    [33]  R. Rivest, L. Adleman, and M. Dertouzos, “On Data Banks andPrivacy Homomorphisms,”   Foundations of Secure Computation,Academic, 1978.

    [34]   S.W. Smith and D. Safford, “Practical Server Privacy with SecureCoprocessors,” IBM Systems J.,  vol. 40, no. 3, pp. 683-695, 2001.

    [35]   S. Wu, F. Li, S. Mehrotra, and B.C. Ooi, “Query Optimization forMassively Parallel Data Processing,”   Proc. Second ACM Symp.

    Cloud Computing (CCS ’11),  Article 12, 2011.[36]   S. Agrawal, V. Narasayya, and B. Yang, “Integrating Vertical and

    Horizontal Partitioning into Automated Physical Database De-sign,”   Proc. ACM SIGMOD Int’l Conf. Management of Data(SIGMOD ’04),  pp. 359-370, 2004.

    [37]   S.W. Smith, “Outbound Authentication for Programmable SecureCoprocessors,” http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.4066, 2001.

    [38]   S. Ghandeharizadeh and D.J. DeWitt, “Hybrid-Range PartitioningStrategy: A New Declustering Strategy for MultiprocessorDatabase Machines,”   Proc. 16th Int’l Conf. Very Large Data Bases(VLDB), pp. 481-492, 1990.

    [39]   S. Bajaj and R. Sion, “TrustedDB: A Trusted Hardware BasedDatabase with Privacy and Data Confidentiality,”   Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD ’11),  pp. 205-216, 2011.

    [40]   S. Bajaj and R. Sion, “TrustedDB: A Trusted Hardware BasedOutsourced Database Engine,”   Proc. Int’l Conf. Very Large DataBases (VLDB),  2011.

    [41]   A. Thomson and D.J. Abadi, “The Case for Determinism inDatabase Systems,” Proc. VLDB Endowment, vol. 3, no. 1, pp. 70-80,2010.

    [42]   T. Ge and S. Zdonik, “Answering Aggregation Queries in a SecureSystem Model,” Proc. 33rd Int’l Conf. Very Large Data Bases (VLDB),pp. 519-530, 2007.

    [43]   M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “FullyHomomorphic Encryption over the Integers,” Proc. 29th Ann. Int’lConf. Theory and Applications of Cryptographic Techniques (EURO-CRYPT ’10),  pp. 24-43, 2010.

    [44]   H. Wang and L.V.S. Lakshmanan, “Efficient Secure QueryEvaluation over Encrypted XML Databases,” Proc. 32nd Int’l Conf.

    Very Large Data Bases (VLDB),  2006.[45]   S. Wang, D. Agrawal, and A.E. Abbadi, “A ComprehensiveFramework for Secure Query Processing on Relational Data in theCloud,”   Proc. Eighth VLDB Int’l Conf. Secure Data Management,pp 52-69 2011

    Sumeet Bajaj   received the bachelor’s degreefrom Pune Institute of Computer Technology in2003 and the master’s degree from Stony BrookUniversity in 2006, where he is currently workingtoward the PhD degree. His industry experienceinvolves building cloud services, financial trad-ing, and ERP systems. His research interestsinclude network security, databases, and dis-tributed and concurrent systems.

    Radu Sion is currently an associate professor ofcomputer science at Stony Brook University. Hisresearch interests include cyber security andefficient computing. He builds systems mainly,but enjoys elegance and foundations, especiallyif of the very rare practical variety. His sponsorsand collaborators include the US NationalScience Foundation, US Army, Northrop Grum-man, IBM, Microsoft, Motorola, NOKIA, andXerox.

    .   For more information on this or any other computing topic,

    please visit our Digital Library at   www.computer.org/publications/dlib.

    BAJAJ AND SION: TRUSTEDDB: A TRUSTED HARDWARE-BASED DATABASE WITH PRIVACY AND DATA CONFIDENTIALITY 765