trustworthy keyword search for regulatory compliant record retention windsor w. hsu ibm almaden...

37
Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra University of Illinois IBM Almaden

Upload: posy-stafford

Post on 18-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Trustworthy Keyword Search for Regulatory Compliant Record Retention

Windsor W. Hsu

IBM Almaden

Marianne Winslett

University of Illinois

Soumyadeb Mitra

University of Illinois

IBM Almaden

Page 2: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

There is a need for trustworthy record keeping

HIPAA

Focus on ComplianceFocus on Compliance

DigitalDigitalInformationInformationExplosionExplosion

SoaringSoaringDiscoveryDiscovery

CostsCosts

Spending oneDiscovery Growing

at 65% CAGR

Average F500Company Has 125

Non-FrivolousLawsuits at Any

Given Time

IDC Forecasts60B Business

Emails AnnuallyBy 2006

Instant Messaging

Files

Email

Records

Corporate Corporate MisconductMisconduct

Sources: IDC, Network World (2003), Socha / Gelbmann (2004)

Page 3: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

What is trustworthy record keeping?

Alice

Storage Device

time

Bob

QueryRegret

Adversary

CommitRecord

Establish solid proof of events that have occurred

Bob should get back Alice’s data

Page 4: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

This leads to a unique threat model

time

Query is trustworthy

Commit is trustworthy

Adversary has super-user privileges

Record is createdproperly

Record is queriedproperly

• Access to storage device• Access to any keys

Adversary could be Alice herself

Page 5: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Traditional schemes do not work

time

Cannot rely on Alice’s signature

Page 6: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

WORM storage helps address the problem

New Record Record Overwrite/

Delete

Write Once Read Many

Adversary cannotdelete Alice’s record

Page 7: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Index required due to high volume of records

Query from Index

Alice

time

Bob

Regret

Adversary

Commit RecordUpdate Index

Index

Page 8: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

In effect, records can be hidden/altered by modifying the index

Or replace B with B’

The index must also be secured (fossilized)

Hide record B from the

index A B B’B

Page 9: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Most business records are unstructured, searched by inverted index

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3

Keywords Posting Lists

One WORM file for each posting list

Page 10: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Index must be updated as new documents arrive

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3

Keywords Posting Lists

Data

Index

Query

Doc: 7979

79

79

500 keywords = 500 disk seeks ~1 sec per document

Page 11: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Amortize cost by updating in batch

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3

Keywords Posting Lists

Query

Doc: 79

Query

Doc: 82

Query

Doc: 83

79 81 83

Buffer

1 seek per keyword in batch Large buffer to benefit infrequent terms

Over 100,000 documents to achieve 2 docs/sec

Doc: 80Doc: 81

Page 12: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Index is not updated immediately

Alice

Adversary

Index

Buffer

time

OmitAlter

Buffer

Prevailing practice – email must be committed before it is delivered

CommitRecord

Page 13: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Can storage server cache help?

Storage servers have huge cache Data committed into cache is effectively on

disk Is battery backed-up Inside the WORM box, so is trustworthy

Page 14: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Query

Data

Base

Worm

Index

1 3 11 17

3 9

3 19

7 36

3

Base

Index

Query

Doc: 7979

Cache Miss

Caching does not benefit infrequent terms

79 Cache Miss

79 Cache Miss

Query

Doc: 80

80

Cache Hit

Caching works in blocks

Page 15: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Simulation results show caching is not enough

Cache Misses Per Doc

0

50

100

150

200

250

300

350

400

450

500

0 4 8 16 32 64 128

256

512

1024

2048

4096

Cache Size

I/O

Per

Do

c

Cache Miss

Page 16: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Simulation results show caching is not enough

What if number posting lists ≤ Number of cache blocks?

Each update will hit the cache

Page 17: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

So, merge posting lists so that the tails blocks fit in cache

Query

Data

Base

Worm

Index

1 3 11

3 9 31

3 19

7 36

3

Only 1 random I/O per document, for 4K block size

00 1 00 3 01 3 10 3

01 9 00 11 10 19 01 31

Document IDs

Keyword Encodings

Page 18: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

VLDB 2006 : Query answered by scanning both posting lists

The tradeoff is longer lists to scan during lookup

Workload lookup cost before merging:

∑ tw qw

After merging into A = {A1, …, An} :

∑ ( ∑ tw ) (∑ qw )w € A w € AA

w

length of posting list for keyword w

# of times w is queried in workload

length of A

# of times A is searched

Page 19: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Which lists to merge? We need a heuristic solution

Choose A={A1, A2 .. An} n = Cache blocks Minimize ∑ ( ∑ tw ) * (∑ qw )

Problem is NP-complete Reduction from minimum-sum-of-squares

So, try some merging heuristics on a real-world workload 1 million documents from IBM’s intranet 300,000 queries

Page 20: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

A few terms contribute most of the query workload cost

0.E+00

1.E+09

2.E+09

3.E+09

4.E+09

5.E+09

6.E+09

0 5000 10000 15000 20000 25000

Term Rank

Wo

rklo

ad C

ost

QF

TF

(tw *qw)

Page 21: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Different merging heuristics were tried

Separate lists for high contributor terms Merging heuristics

Based on qw tw

Random merging Details of heuristics, evaluation in paper

Page 22: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Additional index support is needed to answer conjunctive queries quickly

2

7

13

24

31

3

24

Merge Join : O (m+n)

24

2

7

13

24

31

3

24

24

Index Join : m log(n)

2

13

31

2

31

VLDB and 2006

VLDB 2006

n

m

Page 23: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

How to maintain B+trees on WORM B+trees require node split and join B+tree on posting list are special case

Documents IDs are inserted in an increasing order Can be built bottom up without split/joins Please refer to our paper

Page 24: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

B+tree index is insecure, even on WORM

2 4 7 11 13 19 23 29 31

7 13

23

31

25

25 26 30

27

Path to an element depends on elements inserted later – Adversary can attack it

Page 25: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Our solution is jump indexes

Path to an element only depends on elements inserted before

Jump index is provably trustworthy Leverages the fact that document IDs are increasing O(log N) lookup : N - # of documents

Supports range queries too Reasonable performance as compared to B+ trees

for conjunctive queries in experiments with real-workload

For details, see our paper

Page 26: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Conclusions

WORM storage by itself is not enough We need a trustworthy index too

Trustworthy inverted indexes can be built efficiently 10-15% slowdown, for non-conjunctive queries Within 1.5x of optimal B-tree performance, for

conjunctive queries Other possible uses

Indexing time

Page 27: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Future work

Ranking attacks Adversary can attack a lot of junk documents

Secure generic index structure Path to an element is fixed Supports range queries

Page 28: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Questions

Page 29: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

A new index structure required: Jump Index

n 0 1 2 3 4 5

Element Pointers

ith pointer points to an element ni

n + 2i ≤ ni < n + 2(i+1)

n+2

n + 2 1 ≤ n+2 < n + 2 2

Page 30: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Jump index in action

1

2

1 + 2 0 ≤ 2 < 1 + 2 1 1 + 2 2 ≤ 5 < 1 + 2 3

0 1 2 3 4

71 + 2 2 ≤ 7 < 1 + 2 3

5 + 2 1 ≤ 7 < 5 + 2 2

5

0 1 2 3 4

Already Set

Follow Pointer

log(N) pointers to N

Page 31: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Path to an element does not depend on future elements

1

2

0 1 2 3 4

7

5

0 1 2 3 4

1 + 2 2 ≤ 7 < 1 + 2 3

Follow Pointer

Start here

5 + 2 1 ≤ 7 < 5 + 2 2

Got 7

Lookup (7)

Page 32: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Jump index elements are stored in blocks

Jump Pointersp entries(0

,1)

(0,2

)

(0,B

-1)

(1,0

)

(0,1

)

( i ,

j )

.. ..l

l + j*Bi ≤ x < l + (j+1)*Bi

Storing pointers with every element is inefficient p entries are grouped together Brach factor B.

Pointer (i,j) from block b points to b’ having smallest x

Page 33: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Jump index evaluation parameters p : Elements grouped together B : Branching factor L : Block size

p + (B-1) logB N ≤ L

Evaluation Index update performance

Pointers have to be set Query performance

Page 34: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Update performance levels off at reasonable cache size

0

10

20

30

40

50

60

128MB

160MB

192MB

224MB

256MB

288MB

320MB

352MB

384MB

Cache Size

I/O

s p

er

Do

c 16

32

64

128

Page 35: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Query performance is close to optimal (Btree)

0

1

2

3

4

5

6

7

2 3 4 5 6

Terms in Query

Sp

ee

du

p

1-8

1-16

1-32

1-64

1-128

"btree"

Page 36: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Btree on WORM

Page 37: Trustworthy Keyword Search for Regulatory Compliant Record Retention Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois Soumyadeb Mitra

Is this a real threat?

Would someone want to delete a record after a day its created?

Intrusion detection logging Once adversary gain control, he would like to

delete records of his initial attack Record regretted moments after creation

Email best practice - Must be committed before its delivered