Download - Sovereign Information Integration
Sovereign Information Sovereign Information IntegrationIntegration
Rakesh AgrawalRakesh Agrawal
Jt. Work with Srikant & EvfimievskiEvfimievski
OutlineOutline
Motivation Motivation Problem StatementProblem Statement ProtocolsProtocols ChallengesChallenges
Assumption: Information in each database can be Assumption: Information in each database can be freely shared.freely shared.
Information Integration TodayInformation Integration Today
Mediator
Q R
Federated
Q R
Centralized
Need for a new style of Need for a new style of information sharinginformation sharing
Compute queries across databases so that no more Compute queries across databases so that no more information than necessary is revealed (without information than necessary is revealed (without using a trusted third party).using a trusted third party).
Need is driven by several trends:Need is driven by several trends:– End-to-end integration of information systems End-to-end integration of information systems
across companies.across companies.– Simultaneously compete and cooperate.Simultaneously compete and cooperate.– Security: need-to-know information sharingSecurity: need-to-know information sharing
Selective Document SharingSelective Document Sharing
R is shopping for R is shopping for technology.technology.
S has intellectual S has intellectual property it may want to property it may want to license.license.
First find the specific First find the specific technologies where there technologies where there is a match, and then is a match, and then reveal further information reveal further information about those.about those.
R
ShoppingList
S
TechnologyList
Example 2: Govt. agencies sharing information on a
need-to-know basis.
Medical Research Medical Research
Validate hypothesis Validate hypothesis between adverse between adverse reaction to a drug and a reaction to a drug and a specific DNA sequence.specific DNA sequence.
Researchers should not Researchers should not learn anything beyond 4 learn anything beyond 4 counts:counts:
MayoClinic
DNA Sequences
DrugReactions
Adverse ReactionAdverse Reaction No Adv. ReactionNo Adv. Reaction
Sequence PresentSequence Present ?? ??
Sequence AbsentSequence Absent ?? ??
CaveatsCaveats
Schema Discovery & HeterogeneitySchema Discovery & Heterogeneity Multiple QueriesMultiple Queries
And of course…
Mediator
Q R
Mediator
Q R
Minimal
Necessary
Hybrids of Centralized, Federated, and Sovereign Architectures
OutlineOutline
MotivationMotivation Problem StatementProblem Statement ProtocolsProtocols ChallengesChallenges
R S R must not
know that S has b & y
S must not know that R has a & x
uu
vv
RSaa
uu
vv
xx
bb
uu
vv
yy
R
S
Count (R S) R & S do not learn
anything except that the result is 2.
Minimal Necessary SharingMinimal Necessary Sharing
Problem Statement:Problem Statement:Minimal SharingMinimal Sharing
Given:Given:– Two parties (honest-but-curious): R (receiver) and S Two parties (honest-but-curious): R (receiver) and S
(sender)(sender)– Query Q spanning the tables R and SQuery Q spanning the tables R and S– Additional (pre-specified) categories of information IAdditional (pre-specified) categories of information I
Compute the answer to Q and return it to R without revealing Compute the answer to Q and return it to R without revealing any additional information to either party, any additional information to either party, except for the except for the information contained in Iinformation contained in I– For intersection, intersection size & equijoin, For intersection, intersection size & equijoin,
I = { |R| , |S| }I = { |R| , |S| }
– For equijoin size, I also includes the distribution of duplicates & For equijoin size, I also includes the distribution of duplicates & some subset of information in R some subset of information in R S S
A Possible ApproachA Possible Approach
Secure Multi-Party ComputationSecure Multi-Party Computation– Given two parties with inputs x and y, compute f(x,y) such Given two parties with inputs x and y, compute f(x,y) such
that the parties learn only f(x,y) and nothing else.that the parties learn only f(x,y) and nothing else.– Can be solved by building a combinatorial circuit, and Can be solved by building a combinatorial circuit, and
simulating that circuit [Yao86].simulating that circuit [Yao86].
Prohibitive cost for database-size problems.Prohibitive cost for database-size problems.– Intersection of two relations of a million records each Intersection of two relations of a million records each
would require 144 days (Yao’s protocol)would require 144 days (Yao’s protocol)
OutlineOutline
MotivationMotivation Problem StatementProblem Statement ProtocolsProtocols ChallengesChallenges
Intersection Protocol: IntuitionIntersection Protocol: Intuition
Want to encrypt the value in R and S and compare Want to encrypt the value in R and S and compare the encrypted values.the encrypted values.
However, want an encryption function such that it However, want an encryption function such that it can only be jointly computed by R and S, not can only be jointly computed by R and S, not separately.separately.
Commutative EncryptionCommutative Encryption
Commutative encryption F is a computable function Commutative encryption F is a computable function f : Key F X Dom F -> Dom F, satisfying: f : Key F X Dom F -> Dom F, satisfying:
– For all e, e’ For all e, e’ Key F, f Key F, fe e oo ffe’ e’ = f= fe’ e’ oo ffee
(The result of encryption with two different keys is the same, (The result of encryption with two different keys is the same, irrespective of the order of encryption)irrespective of the order of encryption)
– Each Each ffe e is a bijection.is a bijection.
(Two different values will have different encrypted values)(Two different values will have different encrypted values)
– The distribution of <x, fThe distribution of <x, fee(x), y, f(x), y, fee(y)> is indistinguishable from the (y)> is indistinguishable from the
distribution of <x, fdistribution of <x, fee(x), y, z>; x, y, z (x), y, z>; x, y, z rr Dom F and e Dom F and e rr Key F. Key F.
(Given a value x and its encryption f(Given a value x and its encryption fee(x), for a new value y, we (x), for a new value y, we
cannot distinguish between fcannot distinguish between fee(y) and a random value z. Thus we (y) and a random value z. Thus we
cannot encrypt y nor decrypt fcannot encrypt y nor decrypt fee(y).)(y).)
Example Commutative Example Commutative EncryptionEncryption
ffee(x) = x(x) = xee mod p mod p
wherewhere– p: safe prime number, i.e., both p and q=(p-1)/2 p: safe prime number, i.e., both p and q=(p-1)/2
are primesare primes– encryption key e encryption key e 1, 2, …, q-1 1, 2, …, q-1– Dom F: all quadratic residues modulo pDom F: all quadratic residues modulo p
Commutativity: powers commuteCommutativity: powers commute(x(xdd mod p) mod p)ee mod p = x mod p = xdede mod p = (x mod p = (xee mod p) mod p)dd mod p mod p
Indistinguishability follows from Decisional Diffie-Indistinguishability follows from Decisional Diffie-Hellman Hypothesis (DDH)Hellman Hypothesis (DDH)
Intersection ProtocolIntersection Protocol
RS
R S
Secret key
r s
fs(S )We apply fs on h(S), where h is a hash function, not
directly on S.
Shorthand for { fs(x) | x S }
R
Intersection ProtocolIntersection Protocol
S
R S
fs(S)fs(S )
fr(fs(S ))
r s
fs(fr(S ))
Commutative property
R
Intersection ProtocolIntersection Protocol
S
R
S
fr(R )
fr(R )
fs(fr(S ))
<y, fs(y)> for y fr(R)
r s
<x, fs(fr(x))> for x R
<y, fs(y)> for y fr(R)
Since R knows<x, y=fr(x)>
Intersection Size ProtocolIntersection Size Protocol
R S
R S
fr(R ) fs(S )
fs(S ) fr(R )
fr(fs(S ))
r s
fs(fr(R ))
fr(fs(R))
R cannot map z fr(fs(R))
back to x R.
Not <y, fs(y)> for y fr(R)
Equijoin Protocol: IntuitionEquijoin Protocol: Intuition
R needs some extra information ext(v) for values v R needs some extra information ext(v) for values v R R S. S.– ext(v): information about the other attributes in ext(v): information about the other attributes in
S for those records where S.A = v S for those records where S.A = v S has second secret key s’S has second secret key s’ For each value v For each value v S, S,
– S generates an encryption key S generates an encryption key = f = fs’s’(v), and(v), and
– encrypts ext(v) using encryption function K with key encrypts ext(v) using encryption function K with key .. R to learns fR to learns fs’s’(v) only for v (v) only for v R. R.
– ff-1-1r r (f(fs’ s’ (f(frr(v))) = f(v))) = f-1-1
r r (f(fr r (f(fs’s’(v))) = f(v))) = fs’s’(v)(v)
Equi Join and Join SizeEqui Join and Join Size
See Sigmod03 paper See Sigmod03 paper Also gives the correctness proofs as well as the Also gives the correctness proofs as well as the
cost analysis of protocolscost analysis of protocols
Related WorkRelated Work
[Naor & Pinkas 99]: Two protocols for list [Naor & Pinkas 99]: Two protocols for list intersection problemintersection problem– Oblivious evaluation of n polynomials of degree n each.Oblivious evaluation of n polynomials of degree n each.– Oblivious evaluation of nOblivious evaluation of n22 linear polynomials. linear polynomials.
[Huberman et al 99]: find people with common [Huberman et al 99]: find people with common preferences, without revealing the preferences.preferences, without revealing the preferences.– Intersection protocols are similar Intersection protocols are similar
[Clifton et al, 2003]: Secure set union and set [Clifton et al, 2003]: Secure set union and set intersectionintersection– Similar protocolsSimilar protocols
Summary and ChallengesSummary and Challenges
New applications require us to go beyond traditional New applications require us to go beyond traditional centralized and federated information integration: sovereign centralized and federated information integration: sovereign information integrationinformation integration
Need models of minimal disclosure and corresponding Need models of minimal disclosure and corresponding protocols forprotocols for– other database operationsother database operations
– combination of operationscombination of operations Need faster protocolsNeed faster protocols Need further study of tradeoff between efficiency andNeed further study of tradeoff between efficiency and
– additional information disclosedadditional information disclosed
– approximationapproximation
Privacy Preserving Data MiningPrivacy Preserving Data Mining
0
200
400
600
800
1000
1200
2 10 18 26 34 42 50 58 66 74 82
Original Randomized Reconstructed
50 | 40K | ... 30 | 70K | ...
Randomizer Randomizer
Reconstructdistribution
of Age
Reconstructdistributionof Salary
Data Mining Algorithms
Data Mining Model
65 | 20K | ... 25 | 60K | ...
Alice’s age
Alice’s salary
Bob’s age
30+35
0
20
40
60
80
100
120
10 20 40 60 80 100 150 200
Randomization Level
Original Randomized Reconstructed
Insight: Preserve privacy at the individual level, while still building accurate data mining models at the aggregate level.
Add random noise to individual values to protect privacy.
EM algorithm to estimate original distribution of values given randomized values + randomization function.
Algorithms for building classification models and discovering association rules on top of privacy-preserved data with only small loss of accuracy.
Hippocratic DatabaseHippocratic Database
PrivacyPolicy
DataCollection
Queries
PrivacyMetadataCreator
Store
PrivacyConstraintValidator
DataAccuracyAnalyzer
AuditInfo
AuditInfo
AuditTrail
QueryIntrusionDetector
AttributeAccessControl
PrivacyMetadata
Other
DataRetentionManager
RecordAccessControl
EncryptionSupport
DataCollectionAnalyzer
## NameName AgeAge PhonePhone
11 AdamsAdams 1010 111-1111111-1111
33 -- -- 333-3333333-3333
44 DanielsDaniels 4040 --
050
100150200
250300
0.01 0.1 0.2 0.5 1
Application Selectivity
Qu
ery
Execu
tio
n T
ime
(seco
nd
s)
Original Queries
Rewritten Queries
Table Size: 10 million, no index
Vision: Database systems that Vision: Database systems that take responsibility for the take responsibility for the privacy of data they manage, privacy of data they manage, while not impeding the flow of while not impeding the flow of information.information.
Architectural principles derived Architectural principles derived from current privacy from current privacy legislation.legislation.