trustworthy keyword search for regulatory compliant record retention windsor w. hsu ibm almaden...
TRANSCRIPT
Trustworthy Keyword Search for Regulatory Compliant Record Retention
Windsor W. Hsu
IBM Almaden
Marianne Winslett
University of Illinois
Soumyadeb Mitra
University of Illinois
IBM Almaden
There is a need for trustworthy record keeping
HIPAA
Focus on ComplianceFocus on Compliance
DigitalDigitalInformationInformationExplosionExplosion
SoaringSoaringDiscoveryDiscovery
CostsCosts
Spending oneDiscovery Growing
at 65% CAGR
Average F500Company Has 125
Non-FrivolousLawsuits at Any
Given Time
IDC Forecasts60B Business
Emails AnnuallyBy 2006
Instant Messaging
Files
Records
Corporate Corporate MisconductMisconduct
Sources: IDC, Network World (2003), Socha / Gelbmann (2004)
What is trustworthy record keeping?
Alice
Storage Device
time
Bob
QueryRegret
Adversary
CommitRecord
Establish solid proof of events that have occurred
Bob should get back Alice’s data
This leads to a unique threat model
time
Query is trustworthy
Commit is trustworthy
Adversary has super-user privileges
Record is createdproperly
Record is queriedproperly
• Access to storage device• Access to any keys
Adversary could be Alice herself
Traditional schemes do not work
time
Cannot rely on Alice’s signature
WORM storage helps address the problem
New Record Record Overwrite/
Delete
Write Once Read Many
Adversary cannotdelete Alice’s record
Index required due to high volume of records
Query from Index
Alice
time
Bob
Regret
Adversary
Commit RecordUpdate Index
Index
In effect, records can be hidden/altered by modifying the index
Or replace B with B’
The index must also be secured (fossilized)
Hide record B from the
index A B B’B
Most business records are unstructured, searched by inverted index
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Keywords Posting Lists
One WORM file for each posting list
Index must be updated as new documents arrive
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Keywords Posting Lists
Data
Index
Query
Doc: 7979
79
79
500 keywords = 500 disk seeks ~1 sec per document
Amortize cost by updating in batch
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Keywords Posting Lists
Query
Doc: 79
Query
Doc: 82
Query
Doc: 83
79 81 83
Buffer
1 seek per keyword in batch Large buffer to benefit infrequent terms
Over 100,000 documents to achieve 2 docs/sec
Doc: 80Doc: 81
Index is not updated immediately
Alice
Adversary
Index
Buffer
time
OmitAlter
Buffer
Prevailing practice – email must be committed before it is delivered
CommitRecord
Can storage server cache help?
Storage servers have huge cache Data committed into cache is effectively on
disk Is battery backed-up Inside the WORM box, so is trustworthy
Query
Data
Base
Worm
Index
1 3 11 17
3 9
3 19
7 36
3
Base
Index
Query
Doc: 7979
Cache Miss
Caching does not benefit infrequent terms
79 Cache Miss
79 Cache Miss
Query
Doc: 80
80
Cache Hit
Caching works in blocks
Simulation results show caching is not enough
Cache Misses Per Doc
0
50
100
150
200
250
300
350
400
450
500
0 4 8 16 32 64 128
256
512
1024
2048
4096
Cache Size
I/O
Per
Do
c
Cache Miss
Simulation results show caching is not enough
What if number posting lists ≤ Number of cache blocks?
Each update will hit the cache
So, merge posting lists so that the tails blocks fit in cache
Query
Data
Base
Worm
Index
1 3 11
3 9 31
3 19
7 36
3
Only 1 random I/O per document, for 4K block size
00 1 00 3 01 3 10 3
01 9 00 11 10 19 01 31
Document IDs
Keyword Encodings
VLDB 2006 : Query answered by scanning both posting lists
The tradeoff is longer lists to scan during lookup
Workload lookup cost before merging:
∑ tw qw
After merging into A = {A1, …, An} :
∑ ( ∑ tw ) (∑ qw )w € A w € AA
w
length of posting list for keyword w
# of times w is queried in workload
length of A
# of times A is searched
Which lists to merge? We need a heuristic solution
Choose A={A1, A2 .. An} n = Cache blocks Minimize ∑ ( ∑ tw ) * (∑ qw )
Problem is NP-complete Reduction from minimum-sum-of-squares
So, try some merging heuristics on a real-world workload 1 million documents from IBM’s intranet 300,000 queries
A few terms contribute most of the query workload cost
0.E+00
1.E+09
2.E+09
3.E+09
4.E+09
5.E+09
6.E+09
0 5000 10000 15000 20000 25000
Term Rank
Wo
rklo
ad C
ost
QF
TF
(tw *qw)
Different merging heuristics were tried
Separate lists for high contributor terms Merging heuristics
Based on qw tw
Random merging Details of heuristics, evaluation in paper
Additional index support is needed to answer conjunctive queries quickly
2
7
13
24
31
3
24
Merge Join : O (m+n)
24
2
7
13
24
31
3
24
24
Index Join : m log(n)
2
13
31
2
31
VLDB and 2006
VLDB 2006
n
m
How to maintain B+trees on WORM B+trees require node split and join B+tree on posting list are special case
Documents IDs are inserted in an increasing order Can be built bottom up without split/joins Please refer to our paper
B+tree index is insecure, even on WORM
2 4 7 11 13 19 23 29 31
7 13
23
31
25
25 26 30
27
Path to an element depends on elements inserted later – Adversary can attack it
Our solution is jump indexes
Path to an element only depends on elements inserted before
Jump index is provably trustworthy Leverages the fact that document IDs are increasing O(log N) lookup : N - # of documents
Supports range queries too Reasonable performance as compared to B+ trees
for conjunctive queries in experiments with real-workload
For details, see our paper
Conclusions
WORM storage by itself is not enough We need a trustworthy index too
Trustworthy inverted indexes can be built efficiently 10-15% slowdown, for non-conjunctive queries Within 1.5x of optimal B-tree performance, for
conjunctive queries Other possible uses
Indexing time
Future work
Ranking attacks Adversary can attack a lot of junk documents
Secure generic index structure Path to an element is fixed Supports range queries
Questions
A new index structure required: Jump Index
n 0 1 2 3 4 5
Element Pointers
ith pointer points to an element ni
n + 2i ≤ ni < n + 2(i+1)
n+2
n + 2 1 ≤ n+2 < n + 2 2
Jump index in action
1
2
1 + 2 0 ≤ 2 < 1 + 2 1 1 + 2 2 ≤ 5 < 1 + 2 3
0 1 2 3 4
71 + 2 2 ≤ 7 < 1 + 2 3
5 + 2 1 ≤ 7 < 5 + 2 2
5
0 1 2 3 4
Already Set
Follow Pointer
log(N) pointers to N
Path to an element does not depend on future elements
1
2
0 1 2 3 4
7
5
0 1 2 3 4
1 + 2 2 ≤ 7 < 1 + 2 3
Follow Pointer
Start here
5 + 2 1 ≤ 7 < 5 + 2 2
Got 7
Lookup (7)
Jump index elements are stored in blocks
Jump Pointersp entries(0
,1)
(0,2
)
(0,B
-1)
(1,0
)
(0,1
)
( i ,
j )
.. ..l
l + j*Bi ≤ x < l + (j+1)*Bi
Storing pointers with every element is inefficient p entries are grouped together Brach factor B.
Pointer (i,j) from block b points to b’ having smallest x
Jump index evaluation parameters p : Elements grouped together B : Branching factor L : Block size
p + (B-1) logB N ≤ L
Evaluation Index update performance
Pointers have to be set Query performance
Update performance levels off at reasonable cache size
0
10
20
30
40
50
60
128MB
160MB
192MB
224MB
256MB
288MB
320MB
352MB
384MB
Cache Size
I/O
s p
er
Do
c 16
32
64
128
Query performance is close to optimal (Btree)
0
1
2
3
4
5
6
7
2 3 4 5 6
Terms in Query
Sp
ee
du
p
1-8
1-16
1-32
1-64
1-128
"btree"
Btree on WORM
Is this a real threat?
Would someone want to delete a record after a day its created?
Intrusion detection logging Once adversary gain control, he would like to
delete records of his initial attack Record regretted moments after creation
Email best practice - Must be committed before its delivered