a security model for full-text file system search in multi

A Security Model for Full-Text File System

Search in Multi-User Environments

Stefan Buttcher and Charles L. A. Clarke

School of Computer Science

University of Waterloo

Waterloo, Ontario, Canada

{sbuettch,claclark}@plg.uwaterloo.ca

Abstract

Most desktop search systems maintain per-user in-dices to keep track of file contents. In a multi-userenvironment, this is not a viable solution, becausethe same file has to be indexed many times, oncefor every user that may access the file, causing bothspace and performance problems. Having a singlesystem-wide index for all users, on the other hand,allows for efficient indexing but requires special se-curity mechanisms to guarantee that the search re-sults do not violate any file permissions.

We present a security model for full-text file sys-tem search, based on the UNIX security model, anddiscuss two possible implementations of the model.We show that the first implementation, based on apostprocessing approach, allows an arbitrary userto obtain information about the content of files forwhich he does not have read permission. The sec-ond implementation does not share this problem.We give an experimental performance evaluationfor both implementations and point out query opti-mization opportunities for the second one.

1 Introduction and Overview

With the advent of desktop and file system searchtools by Google, Microsoft, Apple, and others, effi-cient file system search is becoming an integral com-ponent of future operating systems. These searchsystems are able to deliver the response to a searchquery within a fraction of a second because they in-dex the file system ahead of time and keep an indexthat, for every term that appears in the file system,contains a list of all files in which the term occursand the exact positions within those files (called theterm’s posting list).

While indexing the file system has the obvious ad-vantage that queries can be answered much fasterfrom the index than by an exhaustive disk scan, italso has the obvious disadvantage that a full-text in-

dex requires significant disk space, sometimes morethan what is available. Therefore, it is importantto keep the disk space consumption of the index-ing system as low as possible. In particular, for acomputer system with many users, it is infeasible tohave an individual index for every user in the sys-tem. In a typical UNIX environment, for example,it is not unusual that about half of the file systemis readable by all users in the system. In such asituation, even a single chmod operation – makinga previously private file readable by everybody –would trigger a large number of index update oper-ations if per-user indices were used. Similarly, dueto the lack of information sharing among the indi-vidual per-user indices, multiple copies of the indexinformation about the same file would need to bestored on disk, leading to a disk space consumptionthat could easily exceed that of the original file.

We investigated different desktop search tools, byGoogle1, Microsoft2, Apple3, Yahoo4, and Coper-nic5, and found that all but Apple’s Spotlight main-tain a separate index for every user (Google’s searchtool uses a system-wide index, but this index mayonly be accessed by users with administrator rights,which makes the software unusable in multi-user en-vironments). While this is an unsatisfactory solu-tion because of the increased disk space consump-tion, it is very secure because all file access permis-sions are automatically respected. Since the index-ing process has the same privileges as the user thatit belongs to, security restrictions cannot be vio-lated, and the index accurately resembles the user’sview of the file system.

If a single system-wide index is used instead, thisindex contains information about all files in the filesystem. Thus, whenever the search system pro-

1http://desktop.google.com/2http://toolbar.msn.com/3http://www.apple.com/macosx/features/spotlight/4http://desktop.yahoo.com/5http://www.copernic.com/

1

cesses a search query, care has to be taken thatthe results are consistent with the user’s view ofthe file system. A search result is obviously incon-sistent with the user’s view of the file system if itcontains files for which the user does not have readpermission. However, there are more subtle cases ofinconsistency. In general, we say that the result toa search query is inconsistent with the user’s viewof the file system if some aspect of it (e.g., the or-der in which matching files are returned) dependson the content of files that cannot be read by theuser. Examples of such inconsistencies are discussedin section 5.

An obvious way to address the consistency prob-lem is the postprocessing approach: The same,system-wide index is used for all users, and everyquery is processed in the same way, regardless ofwhich user submitted the query; after the queryprocessor has computed the list of files matchingthe query, all file permissions are checked, and filesthat may not be searched by the user are removedfrom the final result. This approach, which is usedby Apple’s Spotlight search system (see [App05] fordetails), works well for Boolean queries. However,pure Boolean queries are not always appropriate.If the number of files in a file system is large, thesearch system has to do some sort of relevance rank-ing in order to present the most likely relevant filesfirst and help the user find the information he islooking for faster. Usually, a TF/IDF-based (termfrequency / inverse document frequency) algorithmis used to perform this relevance ranking.

In this paper, we present a full-text search secu-rity model. We show that, if a TF/IDF-style rank-ing algorithm is used by the search system, an im-plementation of the security model must not followthe postprocessing approach. If it does, it producessearch results that are inconsistent with the user’sview of the file system. The inconsistencies can beexploited by the user in a systematical way and al-low him to obtain information about the contentof files which he is not allowed to search. Whilewe do not know the exact ranking algorithm em-ployed by Apple’s Spotlight, we conjecture that itis at least in parts based on the TF/IDF paradigm(as TF/IDF-based algorithms are the most popularranking techniques in information retrieval systems)and therefore amenable to the attacks described inthis paper.

After discussing possible attacks on the postpro-cessing approach, we present a second approach tothe inconsistency problem which guarantees that allsearch results are consistent with the user’s view ofthe file system and which therefore does not allow

a user to infer anything about the content of fileswhich he may not search. This safe implementationof the file system search security model is part ofthe Wumpus6 file system search engine. The sys-tem is freely available under the terms of the GNUGeneral Public License.

In the next two sections, we give a brief overviewof previous work on security issues in multi-user en-vironments (section 2) and an introduction to basicinformation retrieval techniques (section 3). Thisintroduction covers the Okapi BM25 relevance rank-ing function (section 3.1) and the structural querylanguage GCL (section 3.2) on which our retrievalframework and the safe implementation of the secu-rity model are based.

In section 4, we present a general file systemsearch security model and define what it meansfor a file to be searchable by a user. Section 5discusses the first implementation of the securitymodel, based on the postprocessing approach de-scribed above. We show how this implementationcan be exploited in order to obtain the total numberof files in the file system containing a certain term.This is done by systematically creating and deletingfiles, submitting search queries to the search sys-tem, and looking at either the relevance scores orthe relative ranks of the files returned by the searchengine.

In section 6, we present a second implementationof the security model. This implementation is im-mune against the attacks described in section 5. Itsperformance is evaluated experimentally in section7 and compared to the performance of the post-processing approach. Opportunities for query opti-mization are discussed in section 8, where we showthat making an almost non-restrictive assumptionabout the indepence of different files allows us tovirtually nullify the overhead of the security mech-anisms in the search system.

2 Related Work

While some research has been done in the areaof high-performance dynamic indexing [BCC94][LZW04], which is also very important for file sys-tem search, the security problems associated withfull-text search in a multi-user environment havenot yet been studied.

In his report on the major decisions in the de-sign of Microsoft’s Tripoli search engine, Pelto-nen [Pel97] demands that “full text indexing mustnever compromise operating or file system security”.

6http://www.wumpus-search.org/

2

However, after this initial claim, the topic is notmentioned again in his paper. Turtle and Flood[TF95] touch the topic of text retrieval in multi-userenvironments, but only mention the special memoryrequirements, not the security requirements.

Griffiths and Wade [GW76] and Fagin [Fag78]were among the first who investigated secu-rity mechanisms and access control in relationaldatabase systems (System R). Both papers studydiscretionary access control with ownership-basedadministration, in some sense similar to the UNIXfile system security model [RT74] [Rit78]. However,their work goes far beyond UNIX in some aspects.For example, in their model it is possible that a usergrants the right to grant rights for file (table) accessto other users, which is impossible in UNIX. Bertinoet al. [BJS95] give an overview of database secu-rity models and access control mechanisms, such asgroup authorization [WL81] and authorization re-vocation [BSJ97].

While our work is closely related to existing re-search in database security, most results are not ap-plicable to the file system search scenario becausethe requirements of a relational database are differ-ent from those of a file search system and becausethe security model of the search system is ratherstrictly predetermined by the security model of theunderlying file system, UNIX in our case, whichdoes not allow most of the operations database man-agement systems support. Furthermore, the opti-mizations discussed in section 8 cannot be realizedin a relational database systems.

3 Information Retrieval Basics

In this section, we give an introduction to basicinformation retrieval techniques. We first explainOkapi BM25, one of the most popular relevanceranking techniques, and then present the structuralquery language GCL which can be used to expressqueries like

Find all documents in which “mad” and“cow” occur within a distance of 3 wordsfrom each other.

We also show how BM25 and GCL can be combinedin order to compute collection term statistics on thefly. We chose GCL as the underlying query languagebecause it offers very light-weight operators to de-fine structural query constraints.

3.1 Relevance Ranking: TF/IDF and theOkapi BM25 Scoring Function

Most relevance ranking functions used in today’sinformation retrieval systems are based on the vec-tor space model and the TF/IDF scoring paradigm.Other techniques, such as latent semantic indexing[DDL+90] or Google’s pagerank [PBMW98], do ex-ist, but cannot be used for file system search:

• Latent semantic indexing is appropriate for in-formation retrieval purposes, but not for theknown-item search task associated with file sys-tem search; users are searching for the exactoccurrence of query terms in files, not for se-mantic concepts.

• Pagerank cannot be used because there are usu-ally no cross-references between the files in a filesystem.

Suppose a user sends a query to the search sys-tem requesting certain pieces of data matching thequery (files in our scenario, documents in traditionalinformation retrieval). The vector space model con-siders the query and all documents in the text col-lection (files in the file system) vectors in an n-dimensional vector space, where n is the numberof different terms in the search system’s vocabulary(this essentially means that all terms are consideredindependent). The document vectors are ranked bytheir similarity to the query vector, for example theangle between a document and the query vector.

TF/IDF (term frequency, inverse document fre-quency) is one possible way to define the similaritybetween a document and the search query. It meansthat a document is more relevant if it contains moreoccurrences of a query term (has greater TF value),and that a query term is more important if it occursin fewer documents (has greater IDF value). Oneof the most prominent TF/IDF document scoringfunctions is Okapi BM25 [RWJ+94] [RWHB98].

A BM25 query is a set of (T, qT ) pairs, where T isa query term and qT is T ’s within-query weight. Forexample, when a user searches for “full text search”(not as a phrase, but as 3 individual terms), thisresults in the BM25 query

Q := {(“full”, 1), (“text”, 1), (“search”, 1)}.

Given a query Q and a document D, the docu-ment’s BM25 relevance score is:

s(D) =∑

(T,qT )∈Q

qT · wT · dT · (1 + k1)

dT + k1 · ((1 − b) + b · dlavgdl

), (1)

where dT is the number of occurrences of the termT within D, dl is the length of the document D

3

(number of tokens), and avgdl is the average docu-ment length in the system. The free parameters areusually chosen as k1 = 1.2 and b = 0.75. wT is theIDF weight of the query term T :

wT = log(|D|

|DT |), (2)

where D is the set of all documents in the text col-lection and DT is the set of all documents containingT . In our scenario, the documents are files, and weconsequently denote D as F in the following sec-tions.

3.2 Structural Queries: The GCL QueryLanguage

The GCL (generalized concordance lists) query lan-guage proposed by Clarke et al. [CCB95] supportsstructural queries of various types. We give a briefintroduction to the language because our safe imple-mentation of the security model is based on GCL.

GCL assumes that the entire text collection (filecontents) is indexed as a continuous stream of to-kens. There is no explicit structure in this tokenstream. However, structural components, such asfiles or directories, can be added implicitly be in-serting <file> and </file> tags (or <dir> and</dir>) into the token stream.

A GCL expression evaluates to a set of index ex-tents (i.e., [start, end] intervals of index positions).This is done by first producing an operator tree fromthe given GCL expression and then repeatedly ask-ing the root node of the operator tree for the nextmatching index extent after the one that was seenlast, until there are no more such extents.

GCL’s shortest substring paradigm demands thatif two nested index extents satisfy a query condition,only the inner extent is part of the result set. Thisrestriction limits the number of possible results toa query by the size of the text collection, whereaswithout it there could be

(

n2

)

∈ Θ(n2) result extentsfor a text collection of size n. An example of howthe application of the shortest substring rule affectsthe result to a GCL query is shown in Figure 2.

The original GCL framework supports the follow-ing operators, which are only informally describedhere. Assume E is an index extent and A and B

are GCL expressions. Then:

• E matches (A∧B) if it matches both A and B;

• E matches (A ∨ B) if it matches A or B;

• E matches (A · · ·B) if it has a prefix matchingA and a suffix matching B;

GCL query:

(<doc> · · · </doc>) B ( (mad ∧ cow) C [3] )

Operator tree:

B

· · ·

<doc> </doc>

C

∧

mad cow

[3]

Figure 1: GCL query and resulting operator tree forthe example query: Find all documents in which “mad”

and “cow” occur within a distance of 3 words from each

other.

• E matches (ABB) if it matches A and containsan extent E2 matching B;

• E matches (A 6BB) if it matches A and doesnot contain an extent matching B;

• E matches (A C B) if it matches A and is con-tained in an extent E2 matching B;

• E matches (A 6CB) if it matches A and is notcontained in an extent matching B;

• [n] returns all extents of length n (i.e. all n-term sequences).

We have augmented the original GCL framework bytwo additional operators that let us compute collec-tion statistics:

• #(A) returns the number of extents that matchthe expression A;

• length(A) returns the sum of the lengths of allextents that match A.

Throughout this paper, we use the same notationfor both a GCL expression and the set of index ex-tents that are represented by the expression.

3.3 BM25 and GCL

With the two new operators introduced in section3.2, GCL can be employed to calculate all TF andIDF values that the query processor needs duringa BM25 ranking process. Suppose the set of alldocuments inside the text collection is given by theGCL expression

<doc> · · · </doc>,

4

(1) (2)

(3)

(4)

GCL query:

( to ∧ be )

To be, or not to be, that is the question.

For the given query, only the index extents (1),

(2), and (3) are returned. (4) matches the query,

but contains a substring that also matches the

query. Thus, returning (4) would violate the

shortest substring rule.

Figure 2: An example GCL query and the effect of theshortest substring rule: Find all places where “to” and

“be” co-occur.

denoted as documents, and and the documentwhose score is to be calculated is D. Then D’s rel-evance score is:

∑

T∈Q

wT ·#(T C D) · (1 + k1)

#(T C D) + k1 · ((1 − b) + b · dlavgdl

), (3)

where

wT = log

(

#(documents)

#((documents) B T )

)

,

dl = length(D), and

avgdl =length(documents)

#(documents).

This is very convenient because it allows us to com-pute all collection statistics necessary to rank thesearch results on the fly. Thus, integrating the nec-essary security restrictions into the GCL query pro-cessor in such a way that no file permissions are vi-olated, automatically guarantees consistent searchresults for relevance queries. We will use this prop-erty in section 6.

4 A File System Search Security

Model

In section 1, we have used the term searchable torefer to a file whose content is accessible by a userthrough the search system. In this section, we givea definition of what it means for a file to be search-able by user. Before we do so, however, we have tobriefly revisit the UNIX security model, on which

Table 1: File permissions in UNIX (example). Owner

permissions override group permissions; group permis-sions override others. Access is either granted or rejectedexplicitly (in the example, a member of the group wouldnot be allowed to execute the file, even though every-body else is).

owner group others

Read x x

Write x

eXecute x x

our security model is based, and discuss the tradi-tional UNIX search paradigm.

The UNIX security model [RT74] [Rit78] is anownership-based model with discretionary accesscontrol that has been adopted by many operatingsystems. Every file is owned by a certain user. Thisuser (the file owner) can associate the file with acertain group (a set of users) and grant access per-missions to all members of that group. He can alsogrant access permissions to the group of all usersin the system (others). Privileges granted to otherusers can be revoked by the owner later on.

Extensions to the basic UNIX security model,such as access control lists [FA88], have been im-plemented in various operating systems (e.g., Win-dows, Linux), but the simple owner/group/othersmodel is still the dominant security paradigm.

UNIX file permissions can be represented as a 3×3 matrix, as shown in Table 1. When a user wants toaccess a file, the operating system searches from leftto right for an applicable permission set. If the useris the owner of the file, the leftmost column is taken.If the user is not the owner, but member of thegroup associated with the file, the second column istaken. Otherwise, the third column is taken. Thiscan, for instance, be used to grant read access to allusers in the system except for those who belong toa certain group.

File access privileges in UNIX fall into three dif-ferent categories: Read, Write, and eXecute. Writepermissions can be ignored for the purpose of thispaper, which does not deal with file changes. Thesemantics of the read and execute privileges are dif-ferent depending on whether they are granted for afile or a directory. For files,

• the read privilege entitles a user to read thecontents of a file;

• the execute privilege entitles a user to run thefile as a program.

For directories,

5

• the read privilege allows a user to read the di-rectory listing, which includes file names andattributes of all files and subdirectories withinthat directory;

• the execute privilege allows a user to access filesand subdirectories within the directory.

In the traditional find/grep paradigm, a file canonly be searched by a user if

1. the file can be found in the file system’s direc-tory tree and

2. its content may be read by the user.

In terms of file permissions, the first conditionmeans that there has to be a path from the file sys-tem’s root directory to the file in question, and theuser needs to have both the read privilege and theexecute privilege for every directory along this path,while the second condition requires the user to havethe read privilege for the actual file in question. Thesame rules are used by slocate7 to decide whethera matching file may be displayed to the user or not.

While these rules seem to be appropriate in manyscenarios, they have one significant shortcoming: Itis not possible to grant search permission for a singlefile without revealing information about other filesin the same directory. In order to make a file search-able by other users, the owner has to give them theread privilege for the file’s parent directory, whichreveals file names and attributes of all other fileswithin the same directory.

A possible solution to this problem is to relax thedefinition of searchable and only insist that there isa path from the file system root to the file in ques-tion such that the user has the execution privilegefor every directory along this path. Unfortunately,this conflicts with the traditional use of the readand execution privileges, in which this constellationis usually used to give read permission to all userswho know the exact file name of the file in question(note that, even without read permission for a di-rectory, a user can still access all files in it; he justcannot ls for them). While we think this not as biga problem as the make-the-whole-directory-visibleproblem above, it still is somewhat unsatisfactory.

The only completely satisfying solution would bethe introduction of an explicit fourth access privi-lege, the search privilege, in addition to the existingthree. Since this is very unlikely to happen, as itwould probably break most existing UNIX software,we base our definition of searchable on a combina-tion of read and execute. A file is searchable by auser if and only if

7http://www.geekreview.org/slocate/

1. there is a path from the root directory to thefile such that the user has the execute privilegefor all directories along this path and

2. the user has the read privilege for the file.

Search permissions can be granted and revoked, justas any other permission types, be modifying the re-spective read and execute privileges.

While our security model is based on the sim-ple owner/group/other UNIX security model, it caneasily be extended to other security models, suchas access control lists, as long the set of privileges(R, W, X) stays the same, because it only requiresa user to have certain privileges and does not makeany assumptions about where these privileges comefrom.

This is our basic security model. In order to fullyimplement this model, a search system must notdeliver query results that depend on files that arenot searchable by the user who submitted the query.Two possible implementations of the model are dis-cussed in the following sections. The first implemen-tation does not meet this additional requirement,while the second does.

It should be noted at this point that in mostUNIX file systems the content of a file is actuallyassociated with an i-node instead of the file itself,and there can be multiple files referring to the samei-node. This is taken into account by our search en-gine by assuming an i-node to be searchable if andonly if there is at least one hard link to that i-nodesuch that the above rules hold for the link.

5 A First Implementation of the Se-

curity Model and How to Exploit

It: The Postprocessing Approach

One possible implementation of the security modelis based on the postprocessing approach describedin section 1. Whenever a query is processed, system-wide term statistics (IDF values) are used to rank allmatching files by decreasing similarity to the query.This is always done in the same way, regardless ofwhich user sent the search query. After the rankingprocess has finished, all files for which the user doesnot have search permission (according to the rulesdescribed in the previous section) are removed fromthe final list of results.

Using system-wide statistics instead of user-specific data suggests itself because it allows thesearch system to precompute and store all IDF val-ues, which, due to storage space requirements, is notpossible for per-user IDF values in a multi-user envi-ronment. Precomputing term statistics is necessary

6

for various query optimization techniques [WL93][PZSD96].

In this section, we show how this approach can beexploited to calculate (or approximate) the numberof files that contain a given term, even if the usersending the query does not have read permissionsfor those files. Depending on whether the searchsystem returns the actual BM25 file scores to theuser or only a ranked list of files, without any scores,it is either possible to compute exact term statisticsor approximate them.

One might argue that revealing to an unautho-rized user the number of files that contain a certainterm T1 is only a minor problem. We disagree. It isa major problem, and we give two example scenar-ios in which the ability to infer term statistics canhave disastrous effects on file system security:

• An industrial spy knows that the company heis spying on is developing a new chemical pro-cess. He starts monitoring term frequencies forcertain chemical compounds that are likely tobe involved in the process. After some time,this will have given him enough information totell which chemicals are actually used in theprocess – without reading any files.

• The search system can be used as a covert chan-nel to transfer information from one user ac-count to another, circumventing security mech-anisms like file access logging.

Throughout this section we assume that the num-ber of files in the system is sufficiently large so thatthe addition of a single file does not modify the col-lection statistics significantly. This assumption isnot necessary, but it simplifies the calculations.

5.1 Exploiting BM25 Relevance Scores

Suppose the search system uses a system-wide indexand implements Okapi BM25 to perform relevanceranking on files matching a search query. After allfiles matching a user’s query have been ranked byBM25, all files that may not be searched by theuser are removed from the list. The remaining files,along with their relevance scores, are presented tothe user.

We will determine the total number of files in thefile system containing the term T1 by computing thevalues of the unknown parameters avgdl and |F| inthe BM25 scoring function, as shown in equation(1). We start with |F|, the number of files in thesystem. For this purpose, we generate two randomterms T2 and T3 that do not appear in any file. Wethen create three files F1, F2, and F3:

• F1 contains only the term T2;

• F2 consists of two occurrences of the term T2;

• F3 contains only the term T3.

Now, we send two queries to the search engine: {T2}and {T3}. For the former, the engine returns F1 andF2; for the latter, it returns F3. For the scores ofF1 and F3, we know that

score(F1) =(1 + k1) · log( |F|

2 )

1 + k1 · ((1 − b) + bavgdl

)(4)

and

score(F3) =(1 + k1) · log( |F|

1 )

1 + k1 · ((1 − b) + bavgdl

)(5)

Dividing (4) by (5) results in

score(F1)

score(F3)=

log( |F|2 )

log(|F|)(6)

and thus

|F| = 2(

score(F3)

score(F3)−score(F1)). (7)

Now that we know |F|, we proceed and computethe only remaining unknown, avgdl. Using equation(5), we obtain

avgdl =b

X − 1 + b, (8)

where

X =(1 + k1) · log(|F|) − score(F3)

score(F3) · k1. (9)

Since now we know all parameters of the BM25scoring function, we create a new file F4 which con-tains the term T1 that we are interested in, andsubmit the query {T1}. The search engine returnsF4 along with score(F4). This information is usedto construct the equation

score(F4) =(1 + k1) · log( |F|

|FT1 |)

1 + k1 · ((1 − b) + bavgdl

), (10)

in which |FT1 | is the only unknown. We can there-fore easily calculate its value and after only twoqueries know the number of files containing T1:

|FT1 | = |F| · (1

2)score(F4)·Y , (11)

where

Y =1 + k1 · ((1 − b) + b

avgdl)

1 + k1. (12)

7

To avoid small changes in F and avgdl, the new fileF4 can be created before the first query is submittedto the system.

While this particular technique only works forBM25, similar methods can be used to obtain termstatistics for most TF/IDF-based scoring functions.

5.2 Exploiting BM25 Ranking Results

An obvious countermeasure against this type of at-tack is to restrict the output a little further and notreturn relevance scores to the user. We now showthat even if the response does not contain any rel-evance scores, it is still possible to compute an ap-proximation of |FT1 |, or even its exact value, fromthe order in which matching files are returned by thequery processor. The accuracy of the approximationobtained depends on the number of files Fmax thata user may create. We assume Fmax = 2000.

The basic observation is that most interestingterms are infrequent. This fact is used by the follow-ing strategy: After we have created a single file F0,containing only the term T1, we generate a unique,random term T2 and create 1000 files F1 . . . F1000,each containing the term T2. If in the responseto the query {T1, T2} the file F0 appears beforeany of the other files (F1 . . . F1000), we know that|DT1 | ≤ 1000 and can perform a binary search, vary-ing the number of files containing T2, to computethe exact value of |FT1 |.

If instead F0 appears after the other files(F1 . . . F1000), at least we know that |FT1 | ≥ 1000.It might be that this information is enough evidencefor our purpose. However, if for some reason weneed a better approximation of |FT1 |, we can achievethat, too.

We first delete all files we have created so far.We then generate a second random term T3 andcreate 1,000 files (F ′

1 . . . F ′1000), each containing the

two terms T2 and T3. We generate a third randomterm T4 and create 999 files (F ′

1001 . . . F ′1999) each

of which contains T4. We finally create a last fileF ′

0 that contains the two terms T1 (the term we areinterested in) and T4.

After we have created all the files, we submit thequery {T1, T2, T3, T4} to the search system. The rel-evance scores of the files F ′

0 . . . F ′1000 are:

score(F ′0) = C · (log(

|F|

|FT1 |) + log(

|F|

|FT4 |)), (13)

because F ′0 contains T1 and T4, and

score(F ′i ) = C · (log(

|F|

|FT2 |) + log(

|F|

|FT3 |)) (14)

(for 1 ≤ i ≤ 1000), because all the F ′i contain T2

and T3. The constant C, which is the same forall files created, is the BM25 length normalizationcomponent for a document of length 2:

C =1 + k1

1 + k · ((1 − b) + 2·bavgdl

). (15)

We now subsequently delete one of the filesF ′

1001 . . . F ′1999 at a time, starting with F ′

1999, un-til score(F ′

0) ≥ score(F ′1) (i.e. F ′

0 appears beforeF ′

1 in the list of matching files). Let us assume thishappens after d deletions. Then we know that

log(|F|

|FT1 |) + log(

|F|

1000− d) (16)

≥ 2 · log(|F|

1000) (17)

≥ log(|F|

|FT1 |) + log(

|F|

1000− d + 1) (18)

and thus

− log(|FT1 |) − log(1000− d) (19)

≥ −2 · log(1000) (20)

≥ − log(|FT1 |) − log(1000− d + 1), (21)

which implies

10002

1000− d≥ |FT1 | ≥

10002

1000− d + 1. (22)

If |FT1 | = 11000, for example, this technique wouldgive us the following bounds:

10990 ≤ |FT1 | ≤ 11111.

The relative error here is about 0.5%. Again, binarysearch can be used to reduce the number of queriesnecessary from 1000 to around 10.

If it turns out that the approximation obtainedis not good enough (e.g., if |FT1 | > 40000, wehave a relative error of more than 2%), we re-peat the process, this time with more than 2 termsper file. For |FT1 | = 40001 and 3 terms per file,for instance, this would give us the approximation39556 ≤ |FT1 | ≤ 40057 (relative error: 0.6%) in-stead of 40000 ≤ |FT1 | ≤ 41666 (relative error:2.1%).

Thus, we have shown a way to systematicallycompute a very accurate approximation of the num-ber of files in the file system containing a given term.Hence, if the postprocessing approach is taken toimplement the search security model, it is possiblefor an arbitrary user to obtain information aboutthe content of files for which he does not have readpermission – by simply looking at the order in whichfiles are returned by the search engine.

8

5.3 More Than Just Statistics

The above methods can be used to calculate thenumber of files that contain a particular term.While this already is undesirable, the situationis much worse if the search system allows rele-vance ranking for more complicated queries, suchas boolean queries and phrases.

If, for instance, the search system allows phrasequeries of arbitrary length, then it is possible to usethe search system to obtain the whole content ofa file. Assume we know that a certain file containsthe phrase “A B C”. We then try all possible termsD and and calculate the number of files which con-tain “A B C D” until we have found a D that givesa non-zero result. We then continue with the nextterm E. This way, it is possible to construct the en-tire content of a file using a finite number of searchqueries (although this might take a long time). Asimple n-gram language model [CGHH91] [GS95]can be used to predict the next word and thus in-crease the efficiency of this method significantly.

6 A Second Implementation of the

Security Model: Query Integration

In this section, we describe how structured queriescan be used to apply security restrictions to searchresults by integrating the security restrictions intothe query processing instead of applying them in apostprocessing step. This implementation of the filesystem search security model is part of the Wumpussearch system.

The general layout of the retrieval system isshown in Figure 3. A detailed description is given by[BC05]. All queries (Boolean and relevance queries)are handled by the query processor module. In thecourse of processing a query, it requests posting listsfrom the index manager. Every posting list that issent back to the query processor has to pass the se-curity manager, which applies user-specific restric-tions to the list. As a result, the query processoronly sees those parts of a posting list that lie withinfiles that are searchable by the user who submittedthe query. Since the query processor’s response issolely dependent on the posting lists it sees, the re-sults are guaranteed to be consistent with the user’sview of the file system.

We now explain how security restrictions are ap-plied within the security manager. In our implemen-tation, every file in the file system is represented byan index extent satisfying the GCL expression

<file> · · · </file>.

Sub-Index 1

Index Manager

Sub-Index 2 Sub-Index n.....

Security Manager

Query Processor

User 1

User 2 .....

User m

User m-1

Figure 3: General layout of the Wumpus search system.The index manager maintains multiple sub-indices, onefor every mount point. Whenever the query processorrequests a term’s posting list, the index manager com-bines the sub-lists from all indices into one large list andpasses it to the security manager which applies user-specific security restrictions.

Whenever the search engine receives a query from auser U , the security manager is asked to compute alist FU of all index extents that correspond to fileswhose content is searchable by U (using our securitymodel’s definition of searchable). FU represents theuser’s view of the file system at the moment whenthe search engine received the query. Changes tothe file system taking place while the query is beingprocessed are ignored; the same list FU is used toprocess the entire query.

While the query is being processed, every timethe query processor asks for a term’s posting list(denoted as PT ), the index manager generates PT

and passes it to the security manager, which pro-duces

P(U)T ≡ (PT C FU ),

the list of all occurrences of T within files searchableby U . The operator tree that results from addingthese security restrictions to a GCL query is shownin Figure 4. Their effect on query results is shownin Figure 5.

Since P(U)T is all the query processor ever sees,

it is impossible for it to produce query results thatdepend on the content of files that are not search-able by U . Using the equations from section 3.3,all statistics necessary to perform BM25 relevanceranking can be generated from the user-specificposting lists, making it impossible to infer system-wide term statistics from the order in which match-ing files are returned to the user.

9

Figure 5: Query results for two example queries (“<doc> · · · </doc>” and “mad ∧ cows”) with security restrictionsapplied. Only postings from files that are searchable by the user are considered by the query processor.

GCL query:

(<doc> · · · </doc>) B ( (mad ∧ cow) C [3] )

Operator tree:B

· · ·

C

<doc> FU

C

</doc> FU

C

∧

C

mad FU

C

cow FU

[3]

Figure 4: Integrating security restrictions into the queryprocessing – GCL query and resulting operator tree withsecurity restrictions applied to all posting lists.

Because all operators in the GCL framework sup-port lazy evaluation, it is not necessary to apply thesecurity restrictions to the entire posting list whenonly a small portion of the list is used to process aquery. This is important for query processing per-formance.

It is worth pointing out that this implementa-tion of the security model has the nice propertythat it automatically supports index update opera-tions. When a file is deleted from the file system,this file system change has to be reflected by theindex immediately. Without a security model, ev-ery file deletion would either require an expensivephysical update of the internal index structures, or apostprocessing step would be necessary in which allquery results that refer to deleted files are removedfrom the final list of results [CH98]. The postpro-cessing approach would have the same problems asthe one described in section 5: It would use termstatistics that do not reflect the user’s actual viewof the file system. With our implementation of thesecurity model, file deletions are automatically sup-ported because the

<file> · · · </file>

extent associated with the deleted file is removed

from the security manager’s internal representationof the file system. This way, it is possible to keep theindex up-to-date at minimal cost. Updates to theactual index data, which are very expensive, may bedelayed and applied in batches. A more thoroughdiscussion of index updates and their connection tosecurity mechanisms is given by [BC05].

One drawback of our current implementation isthat, in order to efficiently generate the list of indexextents representing all files searchable by a givenuser, the security manager needs to keep some infor-mation about every indexed i-node in main memory.This information includes the i-nodes start and endaddress in the index address space, owner, permis-sions, etc. and comes to a total of 32 bytes peri-node. For file systems with a few million index-able files, this can become a problem. Keeping thisinformation on disk, on the other hand, is not asatisfying solution, either, since it would make sub-second query times impossible.

7 Performance Evaluation

We evaluated the performance of both implemen-tations of the security model – postprocessing andquery integration – using a text collection knownas TREC4+5-CR (TREC disks 4 and 5 withoutthe Congressional Record). This collection contains528,155 documents, which we split up into 528,155different files in 5,282 directories. The index forthis 2-GB text collection, with full positional infor-mation, requires about 615 MB, including 73 MBthat are consumed by the search system’s internalrepresentation of the directory tree, comprising filenames for all files.

As query set, we used Okapi BM25 queries thatwere created by taking the 100 topics employedin the TREC 2003 Robust track and removing allstop words (using a moderately-sized set of 80 stopwords). The original topics read like:

Identify positive accomplishments of theHubble telescope since it was launched in1991.

The topics were translated into queries that could

10

150

200

250

300

350

400

450

500

550

10 20 30 40 50 60 70 80 90 100

Ave

rage

tim

e pe

r qu

ery

(ms)

Percentage of files searchable by user

Query performance for different security restriction mechanisms

Postprocessing approachQuery integration

150

200

250

300

350

400

450

500

550

10 20 30 40 50 60 70 80 90 100

Ave

rage

tim

e pe

r qu

ery

(ms)



Postprocessing approachQuery integration

(a) Without cache effects: All posting lists have to befetched from disk.

(b) With cache effects: All posting lists are fetchedfrom the disk cache.

Figure 6: Performance comparison – query integration using GCL operators vs. postprocessing approach. When thenumber of files searchable is small, query integration is more efficient than postprocessing because relevance scoresfor fewer documents have to be computed.

by parsed and executed by our retrieval system:

@rank[bm25] "<doc>".."</doc>" by

"positive", "accomplishments", "hubble",

"telescope", "since", "launched", "1991"

On average, a query contained 8.7 query terms,which is significantly more than the 2.2 terms foundin an average web search query [JSBS98]. Nonethe-less, our system can execute the queries in well be-low a second on the TREC4+5-CR text collectionused in our experiments.

We ran several experiments for different percent-ages of searchable files in the entire collection. Thiswas done by changing the file permissions of all filesin the collection between two experiments, makinga random p% readable and the other files unread-able. This way, we are able to see how the relativenumber of files searchable by the user submittingthe search query affects the relative performance ofpostprocessing and query integration.

All experiments were conducted on a PC basedon an AMD Athlon64 3500+ with 2 GB of mainmemory and a 7,200-rpm SATA hard drive.

The results depicted in Figure 6 show that theperformance of the second implementation (queryintegration) is reasonably close to that of the post-processing approach. Depending on whether thetime that is necessary to fetch the postings for thequery terms from disk is taken into account or not,the slowdown is either 54% (Figure 6(a)) or 74%(Figure 6(b)) – when 100% of the files in the in-dex are searchable by the user submitting the query.Performance figures for both the cached and the un-cached case are given because, in a realistic environ-ment, system behavior is somewhere between thesetwo extremes.

As the number of searchable files is decreased,query processing time drops for the query integra-tion approach, since fewer documents have to be ex-amined and fewer relevance scores have to be com-puted, but remains constant for the postprocess-ing approach. This is, because in the postprocess-ing approach, the only part of the query process-ing is the postprocessing, which requires very littletime compared to running BM25 on all documents.As a consequence, query integration is 18%/36%(uncached/cached) faster than postprocessing whenonly 10% of the files are searchable by the user.

8 Query Optimization

Although our GCL-based implementation of the se-curity model does not exhibit an excessively de-creased performance, it is still noticeably slowerthan the postprocessing approach if more than 50%of the files can be searched by the user (22-30%slowdown when 50% of the files are searchable).The slowdown is caused by applying the securityrestrictions (. . . C FU ) not only to every queryterm but also to the document delimiters (<doc>and </doc>). Obviously, in order to guarantee con-sistent query results, it is only necessary to applyto apply them to either the documents (in whichcase the query BM25 function will ignore all oc-currences of query terms that lie outside searchabledocuments) or the query terms (in which case un-searchable documents would not contain any queryterms and therefore receive a score of 0).

However, this optimization would be very specificto the type of the query (TF/IDF relevance rank-ing). More generally, we can see the equivalence of

11

Figure 7: Query results for two example queries (“<doc> · · · </doc>” and “mad ∧ cows”) with revised securityrestrictions applied. Even though <doc> in file 2 and </doc> in file 4 are visible to the query processor, they are notconsidered valid query results, since they are in different files.

GCL query:

(<doc> · · · </doc>) B ( (mad ∧ cow) C [3] )

Operator tree:C

B

· · ·

<doc> </doc>

C

∧

mad cow

B

[3] FU

FU

Figure 8: GCL query and resulting operator tree withoptimized security restrictions applied to the operatortree.

the following three GCL expressions:

((E1 C FU ) B (E2 C FU )),

((E1 C FU ) B E2), and

(E1 B E2) C FU ,

where the first expression is the result of the im-plementation from section 6 when run on the GCLexpression

(E1 B E2).

The three expressions are equivalent because if anindex extent E is contained in another index extentE′, and E′ is contained in a searchable file, then E

has to be contained in a searchable file as well.If we make the (not very constrictive assumption)

that every index extent produced by one of the GCLoperators has to lie completely within a file search-able by the user that submitted the query, then weget additional equivalences:

((E1 C FU ) ∧ (E2 C FU )) ≡ ((E1 ∧ E2) C FU ),

((E1 C FU ) ∨ (E2 C FU )) ≡ ((E1 ∨ E2) C FU ),

((E1 C FU ) · · · (E2 C FU )) ≡ ((E1 · · ·E2) C FU ),

and so on. Limiting the list of extents returned bya GCL operator to those extents that lie entirely

within a single file conceptually means that all filesare completely independent. This is not an unre-alistic assumption, since index update operationsmay be performed in an arbitrary order when pro-cessing events associated with changes in the filesystem. Thus, no ordering of the files in the indexcan be guaranteed, which renders extents spanningover multiple files somewhat useless. The effect thatthis assumption has on the query results is shownin Figure 7.

Note that if we did not make the file indepen-dence assumption, then the right-hand side of theabove equivalences would be more restrictive thanthe left-hand side (in the case of “∨”, for example,the right-hand side mandates that both extents liewithin the same searchable file, whereas the left-hand side only requires that both extents lie withinsearchable files). If we make the assumption, thenin all the cases shown above we can freely decidewhether the security restrictions should be appliedat the leaves of the operator tree or whether theyshould be moved up in the tree in order to achievebetter query performance.

The only GCL operator that does not allow thistype of optimization is the contained-in operator.The GCL expression

((E1 C FU ) C (E2 C FU ))

is not equivalent to

(E1 C E2) C FU ,

since in the second expression E2 can refer to some-thing outside the searchable files without the secu-rity restriction operator (CFU ) “noticing” it. Thiswould allow a user to infer things about terms out-side the files searchable by him, so we cannot movethe security restrictions up in the operator tree inthis case.

At this point, it is not clear to us which operatorarrangement leads to optimal query processing per-formance. Therefore, we follow the simple strategyof moving the security restrictions as far to the topof the of the operator tree as possible, as shown in

12

100

150

200

250

300

350

400

450

500

550

10 20 30 40 50 60 70 80 90 100

Ave

rage

tim

e pe

r qu

ery

(ms)



Postprocessing approachQuery integration (simple)

Query integration (optimized)

100

150

200

250

300

350

400

450

500

550

10 20 30 40 50 60 70 80 90 100

Ave

rage

tim

e pe

r qu

ery

(ms)



Postprocessing approachQuery integration (simple)

Query integration (optimized)

(a) Without cache effects: All posting lists have to befetched from disk.

(b) With cache effects: All posting lists are fetchedfrom the disk cache.

Figure 9: Performance comparison – query integration using GCL operators (simple and optimized) vs. postprocessingapproach. Time per query for the optimized integration is between 61% and 117% compared to postprocessing.

Figure 8. Note that, in the figure, security restric-tions are applied to the sub-expression “[3]” (all in-dex extents of length 3), which, of course, does notmake much sense, but is done by our implementa-tion anyway.

Despite the possibility of other optimizationstrategies leading to better performance for certainqueries, the move-to-top strategy works very well for“flat” relevance queries, such as the ones we used inour experiments:

(<doc> · · · </doc>) B (T1 ∨ T2 ∨ . . . ∨ Tn),

where the Ti are the query terms. The perfor-mance gains caused by moving the security restric-tions to the top of the tree are shown in Figure9. With optimizations, the query integration is be-tween 12-17% slower (100% files visible) and 20-39% faster (10% files visible) than the postprocess-ing approach. Even if most files in the file sys-tem are searchable by the user, this is only a minorslowdown that is probably acceptable, given the in-creased security.

9 Conclusion

Guided by the goal to reduce the overall index diskspace consumption, we have investigated the secu-rity problems that arise if, instead of many per-userindices, a single system-wide index is used to pro-cess search queries from all users in a multi-user filesystem search environment.

If the same system-wide wide is accessed by allusers, appropriate mechanisms have to be employedin order to make sure that no search results violateany file permissions. Our full-text search securitymodel specifies what it means for a user to have theprivilege to search a file. It integrates into the UNIX

security model and defines the search privilege as acombination of read and execution privileges.

For one possible implementation of the securitymodel, based on the postprocessing approach, wehave demonstrated how an arbitrary user can in-fer the number of files in the file system containinga given term, without having read access to thesefiles. This represents a major security problem. Thesecond implementation we presented, a query inte-gration approach, does not share this problem, butmay lead to a query processing slowdown of up to75% in certain situations.

We have shown that, using appropriate query op-timization techniques based on certain properties ofthe structural query operators in our retrieval sys-tem, this slowdown can be decreased to a point atwhich queries are processed between 39% faster and17% slower by the query integration than by thepostprocessing approach, depending on what per-centage of the files in the file system is searchableby the user.

References

[App05] Apple Spotlight Technology Brief.http://images.apple.com/macosx/pdf/MacOSX Spotlight TB.pdf, June 2005.

[BC05] S. Buttcher and C. L. A. Clarke. Index-ing Time vs. Query Time Trade-offs inDynamic Inf. Retr. Systems. WumpusTech. Report 2005-01, July 2005.

[BCC94] E. W. Brown, J. P. Callan, and W. B.Croft. Fast Incremental Indexing forFull-Text Information Retrieval. InVLDB 1994, pages 192–202, Santiago,Chile, September 1994.

13

[BJS95] E. Bertino, S. Jajodia, and P. Samarati.Database Security: Research and Prac-tice. Information Systems, 20(7):537–556, 1995.

[BSJ97] E. Bertino, P. Samarati, and S. Jajo-dia. An Extended Authorization Modelfor Relational Databases. IEEE Trans-actions on Knowledge and Data Engi-neering, 9(1):85–101, 1997.

[CCB95] C. L. A. Clarke, G. V. Cormack, andF. J. Burkowski. An Algebra for Struc-tured Text Search and a Frameworkfor its Implementation. The ComputerJournal, 38(1):43–56, 1995.

[CGHH91] K. Church, W. Gale, P. Hanks, andD. Hindle. Using Statistics in LexicalAnalysis. In Uri Zernik, editor, LexicalAcquisition: Exploiting On-Line Re-sources to Build a Lexicon, pages 115–164, 1991.

[CH98] T. Chiueh and L. Huang. EfficientReal-Time Index Updates in Text Re-trieval Systems. Technical report,Stony Brook University, Stony Brook,New York, USA, August 1998.

[DDL+90] S. C. Deerwester, S. T. Dumais, T. K.Landauer, G. W. Furnas, and R. A.Harshman. Indexing by Latent Seman-tic Analysis. Journal of the Amer-ican Society of Information Science,41(6):391–407, 1990.

[FA88] G. Fernandez and L. Allen. Extendingthe UNIX Protection Model with Ac-cess Control Lists. In Proceedings ofUSENIX, 1988.

[Fag78] R. Fagin. On an Authorization Mecha-nism. ACM Transactions on DatabaseSystems, 3(3):310–319, 1978.

[GS95] W. Gale and G. Sampson. Good-Turing Frequency Estimation WithoutTears. Journal of Quantitative Linguis-tics, 2:217–237, 1995.

[GW76] P. P. Griffiths and B. W. Wade. AnAuthorization Mechanism for a Rela-tional Database System. ACM Trans-actions on Database Systems, 1(3):242–255, 1976.

[JSBS98] B. J. Jansen, A. Spink, J. Bateman,and T. Saracevic. Real Life Informa-tion Retrieval: A Study of User Querieson the Web. SIGIR Forum, 32(1):5–17,

1998.

[LZW04] N. Lester, J. Zobel, and H. E. Williams.In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategiesfor Text Retrieval Systems. In Pro-ceedings of the 27th Conference on Aus-tralasian Comp. Sci., pages 15–23. Aus-tralian Computer Society, Inc., 2004.

[PBMW98] L. Page, S. Brin, R. Motwani, andT. Winograd. The PageRank CitationRanking: Bringing Order to the Web.Technical report, Stanford Digital Li-brary Technologies Project, 1998.

[Pel97] K. Peltonen. Adding Full Text In-dexing to the Operating System. InProceedings of the Thirteenth Interna-tional Conference on Data Engineer-ing (ICDE 1997), pages 386–390. IEEEComputer Society, 1997.

[PZSD96] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval withfrequency-sorted indexes. Journal ofthe American Society for InformationScience, 47(10):749–764, October 1996.

[Rit78] D. M. Ritchie. The UNIX Time-Sharing System: A Retrospective. BellSystems Technical Journal, 57(6):1947–1969, 1978.

[RT74] D. M. Ritchie and K. Thompson. TheUNIX Time-Sharing System. Comm.of the ACM, 17(7):365–375, 1974.

[RWHB98] S. E. Robertson, S. Walker, andM. Hancock-Beaulieu. Okapi at TREC-7. In TREC 1998, November 1998.

[RWJ+94] S. E. Robertson, S. Walker, S. Jones,M. Hancock-Beaulieu, and M. Gatford.Okapi at TREC-3. In TREC 1994,November 1994.

[TF95] H. Turtle and J. Flood. Query Evalua-tion: Strategies and Optimization. In-formation Processing & Management,31(1):831–850, November 1995.

[WL81] P. F. Wilms and B. G. Lindsay.A Database Authorization MechanismSupporting Individual and Group Au-thorization. In Distributed Data Shar-ing Systems, pages 273–292, 1981.

[WL93] W. Wong and D. Lee. Implementationsof Partial Document Ranking Using In-verted Files. Information Processing &Management, 29(5):647–669, 1993.

14

a security model for full-text file system search in multi

Documents