trustworthy keyword search for regulatory-compliant ... trustworthy keyword search for...
Post on 20-Mar-2019
Embed Size (px)
Trustworthy Keyword Search for Regulatory-CompliantRecords Retention
Soumyadeb MitraDept. of Computer Science
University of Illinois atUrbana-Champaign
Windsor W. HsuCS Storage Systems Dept.
IBM Almaden ResearchCenter
Marianne WinslettDept. of Computer Science
University of Illinois atUrbana-Champaign
ABSTRACTRecent litigation and intense regulatory focus on secure re-tention of electronic records have spurred a rush to intro-duce Write-Once-Read-Many (WORM) storage devices forretaining business records such as electronic mail. However,simply storing records in WORM storage is insufficient toensure that the records are trustworthy, i.e., able to pro-vide irrefutable proof and accurate details of past events.Specifically, some form of index is needed for timely ac-cess to the records, but unless the index is maintained se-curely, the records can in effect be hidden or altered, evenif stored in WORM storage. In this paper, we systemati-cally analyze the requirements for establishing a trustwor-thy inverted index to enable keyword-based search queries.We propose a novel scheme for efficient creation of such anindex and demonstrate, through extensive simulations andexperiments with an enterprise keyword search engine, thatthe scheme can achieve online update speeds while main-taining good query performance. In addition, we present asecure index structure for multi-keyword queries that sup-ports insert, lookup and range queries in time logarithmicin the number of documents.
1. INTRODUCTIONDocuments such as electronic mail, financial statements,
meeting memos, drug development logs, and quality assur-ance documents are valuable assets. Key decisions in busi-ness operations and other critical activities are based on in-formation in these documents, so they must be maintainedin a trustworthy fashionsafe from improper destruction ormodification, and readily accessible. Businesses increasinglystore these documents electronically, making them relativelyeasy to delete and modify without leaving much of a trace.
This research was partially supported by an IBM intern-ship.This research was supported by NSF under grants IIS-0331707, CNS-0325951, and CNS-0524695.
Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB 06, September 12-15, 2006, Seoul, Korea.Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09.
Ensuring that records are readily accessible, accurate, cred-ible, and irrefutable is particularly imperative given recentlegal and regulatory trends. The US alone has over 10,000regulations that mandate how records should be managed.Many of those focus on ensuring that records are trustwor-thy (e.g., Securities and Exchange Commission (SEC) Rule17a 4 and the Sarbanes-Oxley Act).
This has led to a rush to introduce Write-Once-Read-Many (WORM) storage devices (e.g., [8, 18, 23]) to en-able proper records retention. However, storing records inWORM storage is inadequate to ensure that they are trust-worthy, i.e., able to provide irrefutable evidence of pastevents. The volume of records and stringent response timerequirements dictate the use of a direct access mechanismsuch as an index to access the records. Furthermore, therecords are likely to be accessed not only during litigationand audits, but also as an integral part of day-to-day busi-ness activitycompanies prefer to maintain only a singlecopy of each record if possible, due to the cost of maintain-ing multiple copies and the need for trustworthy input tobusiness decisions.
If the index through which a record is accessed can besuitably manipulated, the record can, for all practical pur-poses, be hidden or deleted, even if it is stored in WORMstorage. For example, if the index entry pointing to therecord is removed, or made to point to a different record,the original record becomes inaccessible. Hence the indexitself must be maintained in a trustworthy fashion.
To address this issue, researchers proposed the conceptof a fossilized index, which is impervious to such manip-ulations. One such index is the Generalized Hash Tree(GHT)  which supports exact-match lookups of recordsbased on attribute values and hence is most suitable for usewith structured data. However, most business records, suchas email, memos, notes, meeting minutes, etc., are unstruc-tured or semi-structured. The natural query interface forthese documents is keyword search, where the user providesa list of keywords and receives a list of documents that con-tain some or all of the keywords. Keyword based searchesare typically handled by an inverted index .
In this paper, we analyze the requirements for a trustwor-thy index for keyword-based search. We argue that trust-worthy index entries must be durablethe index must beupdated when new documents arrive, and not periodicallydeleted and rebuilt. To this end, we propose a scheme forefficiently updating an inverted index, based on judiciousmerging of the posting lists of terms. Through extensive
simulations and experiments with an IBM intranet searchengine, we demonstrate that the scheme achieves online up-date speed while maintaining good query performance. Wealso present and evaluate jump indexes, a novel trustworthyand efficient index for join operations on posting lists formulti-keyword queries.
The rest of this paper is organized as follows. In Sec-tion 2, we discuss the threat model, analyze related work,and derive the key requirements for a trustworthy invertedindex. We also propose enhancements to WORM storage tofacilitate such an index. In Section 3, we develop the ideaof merging posting lists to enable online update of invertedindexes. We present a trustworthy indexing scheme for post-ing lists in Section 4. In Section 5, we discuss a rank-basedattack and propose countermeasures. Section 6 concludesthe paper.
2.1 Threat ModelWe are concerned with a very specific threat model: a le-
gitimate user Alice creates a document (record R) and com-mits R to WORM storage, through an application. After Rhas been committed, a user Mala begins to regret its exis-tence. Mala will do everything she can to prevent a futureuser Bob (e.g., a regulatory authority) from receiving R asthe answer to one of his queries.
Coverups are often directed by high-level company insid-ers (e.g., CEOs and CFOs). To model these attacks, weassume that Mala can take on the identity of any legitimateuser or superuser in the system, and perform any actionthat person can perform. For example, Mala can write anydata to the WORM device as long as the write does not over-write existing data, and she can read any data on the device.This means that we cannot rely on conventional file/storagesystem access control mechanisms [12, 22] to ensure thatdocuments and indexes are only modifiable by legitimateapplications. However, we assume that physical access tothe WORM device is restricted or monitored so that Malacannot steal or destroy it without raising red flags and trig-gering suspicion and a presumption of guilt. We also assumethat Bob is sufficiently cautious that he will check to makesure he is running a certified version of the search engine andoperating system, so Mala cannot alter Bobs search engineor redirect Bobs I/O requests at the file system level. Simi-larly, we trust the document insertion application Alice usesto commit R (i.e., R does reach WORM storage initially),and assume the WORM device operates properly (i.e., itnever overwrites data).
No one regrets the existence of R until R is already perma-nently in WORM storage. Thus Malas only hope is to keepR out of the index that Bob uses in his search. She can dothis by preventing R from ever getting into the index, or byensuring that R is not in the index Bob uses. This suggestsa strategy for us. First, we can ensure that R is entered inthe index before Mala regrets Rs existence. Second, we canensure that any data that ever enters the index stays acces-sible through it forever (or at least for a mandated retentionperiod). In other words, the index should be trustworthy inthe sense of it being in WORM storage and never losingany old entries when new entries are added.
To stop Mala from preventing R from entering the index,one approach is to insert R and construct the index entry
for R as a single action, because we trust the documentinsertion code to get R into WORM storage initially. IfMala alters the document insertion code after R is inserted,R will still be on WORM storage, so we do not need to trustthe document insertion code once R has been inserted. IfMala alters the index update code, that alteration will takeplace after R has been entered into the index. Thus wemust ensure that the altered index creation code, altereddocument insertion code, or any other application of Malascannot hide Rs index entry from the search engine.
2.2 Storage ModelMagnetic recording currently offers better cost and perfor-
mance than optical recording. Moreover, while immutabilityis often specified as a requirement for records, what is re-quired in practice is that the records be term-immutable,i.e., immutable for a specified retention period. Thus almostall recently-introduced WORM storage devices are built atopconventional rewritable magnetic disks, with write-once se-mantics enforced through software