trustworthy keyword search for regulatory compliant record retention windsor w. hsu ibm almaden...

Trustworthy Keyword Search for Regulatory Compliant Record Retention

Windsor W. Hsu

IBM Almaden

Marianne Winslett

University of Illinois

Soumyadeb Mitra

University of Illinois

IBM Almaden

There is a need for trustworthy record keeping

HIPAA

Focus on ComplianceFocus on Compliance

DigitalDigitalInformationInformationExplosionExplosion

SoaringSoaringDiscoveryDiscovery

CostsCosts

Spending oneDiscovery Growing

at 65% CAGR

Average F500Company Has 125

Non-FrivolousLawsuits at Any

Given Time

IDC Forecasts60B Business

Emails AnnuallyBy 2006

Instant Messaging

Files

Email

Records

Corporate Corporate MisconductMisconduct

Sources: IDC, Network World (2003), Socha / Gelbmann (2004)

What is trustworthy record keeping?

Alice

Storage Device

time

Bob

QueryRegret

Adversary

CommitRecord

Establish solid proof of events that have occurred

Bob should get back Alice’s data

This leads to a unique threat model

time

Query is trustworthy

Commit is trustworthy

Adversary has super-user privileges

Record is createdproperly

Record is queriedproperly

• Access to storage device• Access to any keys

Adversary could be Alice herself

Traditional schemes do not work

time

Cannot rely on Alice’s signature

WORM storage helps address the problem

New Record Record Overwrite/

Delete

Write Once Read Many

Adversary cannotdelete Alice’s record

Index required due to high volume of records

Query from Index

Alice

time

Bob

Regret

Adversary

Commit RecordUpdate Index

Index

In effect, records can be hidden/altered by modifying the index

Or replace B with B’

The index must also be secured (fossilized)

Hide record B from the

index A B B’B

Most business records are unstructured, searched by inverted index

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3

Keywords Posting Lists

One WORM file for each posting list

Index must be updated as new documents arrive

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3


Data

Index

Query

Doc: 7979

79

79

500 keywords = 500 disk seeks ~1 sec per document

Amortize cost by updating in batch

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3


Query

Doc: 79

Query

Doc: 82

Query

Doc: 83

79 81 83

Buffer

1 seek per keyword in batch Large buffer to benefit infrequent terms

Over 100,000 documents to achieve 2 docs/sec

Doc: 80Doc: 81

Index is not updated immediately

Alice

Adversary

Index

Buffer

time

OmitAlter

Buffer

Prevailing practice – email must be committed before it is delivered

CommitRecord

Can storage server cache help?

Storage servers have huge cache Data committed into cache is effectively on

disk Is battery backed-up Inside the WORM box, so is trustworthy

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3

Base

Index

Query

Doc: 7979

Cache Miss

Caching does not benefit infrequent terms

79 Cache Miss

79 Cache Miss

Query

Doc: 80

80

Cache Hit

Caching works in blocks

Simulation results show caching is not enough

Cache Misses Per Doc

0

50

100

150

200

250

300

350

400

450

500

0 4 8 16 32 64 128

256

512

1024

2048

4096

Cache Size

I/O

Per

Do

c

Cache Miss

Simulation results show caching is not enough

What if number posting lists ≤ Number of cache blocks?

Each update will hit the cache

So, merge posting lists so that the tails blocks fit in cache

Query

Data

Base

Worm

Index

1 3 11

3 9 31

3 19

7 36

3

Only 1 random I/O per document, for 4K block size

00 1 00 3 01 3 10 3

01 9 00 11 10 19 01 31

Document IDs

Keyword Encodings

VLDB 2006 : Query answered by scanning both posting lists

The tradeoff is longer lists to scan during lookup

Workload lookup cost before merging:

∑ tw qw

After merging into A = {A1, …, An} :

∑ ( ∑ tw ) (∑ qw )w € A w € AA

w

length of posting list for keyword w

# of times w is queried in workload

length of A

# of times A is searched

Which lists to merge? We need a heuristic solution

Choose A={A1, A2 .. An} n = Cache blocks Minimize ∑ ( ∑ tw ) * (∑ qw )

Problem is NP-complete Reduction from minimum-sum-of-squares

So, try some merging heuristics on a real-world workload 1 million documents from IBM’s intranet 300,000 queries

A few terms contribute most of the query workload cost

0.E+00

1.E+09

2.E+09

3.E+09

4.E+09

5.E+09

6.E+09

0 5000 10000 15000 20000 25000

Term Rank

Wo

rklo

ad C

ost

QF

TF

(tw *qw)

Different merging heuristics were tried

Separate lists for high contributor terms Merging heuristics

Based on qw tw

Random merging Details of heuristics, evaluation in paper

Additional index support is needed to answer conjunctive queries quickly

2

7

13

24

31

3

24

Merge Join : O (m+n)

24

2

7

13

24

31

3

24

24

Index Join : m log(n)

2

13

31

2

31

VLDB and 2006

VLDB 2006

n

m

How to maintain B+trees on WORM B+trees require node split and join B+tree on posting list are special case

Documents IDs are inserted in an increasing order Can be built bottom up without split/joins Please refer to our paper

B+tree index is insecure, even on WORM

2 4 7 11 13 19 23 29 31

7 13

23

31

25

25 26 30

27

Path to an element depends on elements inserted later – Adversary can attack it

Our solution is jump indexes

Path to an element only depends on elements inserted before

Jump index is provably trustworthy Leverages the fact that document IDs are increasing O(log N) lookup : N - # of documents

Supports range queries too Reasonable performance as compared to B+ trees

for conjunctive queries in experiments with real-workload

For details, see our paper

Conclusions

WORM storage by itself is not enough We need a trustworthy index too

Trustworthy inverted indexes can be built efficiently 10-15% slowdown, for non-conjunctive queries Within 1.5x of optimal B-tree performance, for

conjunctive queries Other possible uses

Indexing time

Future work

Ranking attacks Adversary can attack a lot of junk documents

Secure generic index structure Path to an element is fixed Supports range queries

Questions

A new index structure required: Jump Index

n 0 1 2 3 4 5

Element Pointers

ith pointer points to an element ni

n + 2i ≤ ni < n + 2(i+1)

n+2

n + 2 1 ≤ n+2 < n + 2 2

Jump index in action

1

2

1 + 2 0 ≤ 2 < 1 + 2 1 1 + 2 2 ≤ 5 < 1 + 2 3

0 1 2 3 4

71 + 2 2 ≤ 7 < 1 + 2 3

5 + 2 1 ≤ 7 < 5 + 2 2

5

0 1 2 3 4

Already Set

Follow Pointer

log(N) pointers to N

Path to an element does not depend on future elements

1

2

0 1 2 3 4

7

5

0 1 2 3 4

1 + 2 2 ≤ 7 < 1 + 2 3

Follow Pointer

Start here

5 + 2 1 ≤ 7 < 5 + 2 2

Got 7

Lookup (7)

Jump index elements are stored in blocks

Jump Pointersp entries(0

,1)

(0,2

)

(0,B

-1)

(1,0

)

(0,1

)

( i ,

j )

.. ..l

l + j*Bi ≤ x < l + (j+1)*Bi

Storing pointers with every element is inefficient p entries are grouped together Brach factor B.

Pointer (i,j) from block b points to b’ having smallest x

Jump index evaluation parameters p : Elements grouped together B : Branching factor L : Block size

p + (B-1) logB N ≤ L

Evaluation Index update performance

Pointers have to be set Query performance

Update performance levels off at reasonable cache size

0

10

20

30

40

50

60

128MB

160MB

192MB

224MB

256MB

288MB

320MB

352MB

384MB

Cache Size

I/O

s p

er

Do

c 16

32

64

128

Query performance is close to optimal (Btree)

0

1

2

3

4

5

6

7

2 3 4 5 6

Terms in Query

Sp

ee

du

p

1-8

1-16

1-32

1-64

1-128

"btree"

Btree on WORM

Is this a real threat?

Would someone want to delete a record after a day its created?

Intrusion detection logging Once adversary gain control, he would like to

delete records of his initial attack Record regretted moments after creation

Email best practice - Must be committed before its delivered

trustworthy keyword search for regulatory compliant record retention windsor w. hsu ibm almaden...

Documents

cache slide

trustworthy slide

base index query doc

commit record slide

alices record slide

blocks slide

document slide

b b b slide