introduction to searchable encryption -...
TRANSCRIPT
Introduction to Searchable
Encryption
Prof. Ja-Ling Wu
Dept. CSIE & GINM
National Taiwan University
1
What kind of Cryptography?
• Privacy-Preserving Computation (PPC)
– Function (including inputs and outputs) does
not reveal private information :
– Medical records
– Financial Data
– Sensitive Data
– Data Owner
– One who is severed
– …
–
2
MPC vs. PPC
• MPC (Message-Preserving Computation) is
general – it captures all applications.
• Regarding privacy, MPC aims for the following:
– A secure protocol must reveal no more information
than the output of the function itself
– That is, the process of protocol computation reveals
nothing.
• MPC does not deal with the question of
whether or not the function reveals much
information – which is the focus of Privacy-
Preserving Computation (PPC)
3
Privacy-Preserving Computation
Search
query
Data repository
Client wants to preserve search privacy: Private Information Retrieval
Data repository is huge and Valuable Privacy-preserving data mining
Data are encrypted: Search and/or Computation over encrypted data
4
Untrusted Remote Storage
• Remote storage is ubiquitous :
– E-mail, backups,
– Department servers, Yahoo Mail, Gmail
– Cloud Storage (Amazon, iCloud, Dropbox, Google) ,
– Social Web-site (Facebook, Flickr, Youtube),…
5
Untrusted Remote Storage
• Google’s Search Across Computers feature – “In order to share your indexed files between your
computers, we first copy this content to Google Desktop servers located at Google. This is necessary, for example, if one of your computers is turned off or otherwise offline when new or updated items are indexed on another of your machines. We store this data temporarily on Google Desktop servers and automatically delete older flies, and your data is never accessible by anyone doing a Google search.”
• Do you trust this? (the prism project)
6
Searchable Encryption
• Store data externally
– encrypted
– want to search data easily
– avoid downloading everything then decrypt
– allow others to search data without having
access to plaintext
7
Searchable Encryption - Factors
• When searching, what must be protected? – retrieved data
– search query
– search query outcome (was anything found?)
• Scenario – single query vs multiple queries
– non-adaptive: series of queries, each independent of the others
– adaptive: form next query based on previous results
• # of participants – single user (owner of data) can query data
– multiple users can query the data, possibly with access rights defined by the owner
8
Single User Symmetric Searchable
Encryption (SSE)
Non-Adaptive Adaptive
9
Search Over Encrypted Data
• Applications: Storage outsourcing, mail gateways, Google Desktop
(“search across computers”), …
• Untrusted servers and/or Sensitive data Data has to be encrypted
• Encryption hides all information about the data Server cannot
search!
• Client must download entire document collection:
10
Search Over Encrypted Data (cont’d) • Searchable Symmetric Key Encryption where client
performs encryption before storing data
– Recall that public key algorithms might be too slow for
encrypting large data
• Secure index (SI): Auxiliary data structure that allows the
remote server to perform searches efficiently, while keeping
queries and data confidential
• Documents are encrypted; SI is encrypted — “two-layer”
searches performed using trapdoors.
11
Build Lists
Austin
Baltimore
Washington
Determine words in each D to create D(w)’s
Build linked lists, where D(wi) is the set of identifiers of
documents containing the word wi ordered in lexicographic order – Dictionary!
12
Create Lists
Austin
Baltimore
Washington
Encrypt linked lists: establish keys, pointers, encrypt
13
Build Index Table
Austin
Baltimore
Washington
f( )
f( )
f( )
g( )
g( )
g( )
Build lookup table T to manage the keys for the
first keywords
14
Create Array
Merge, scramble linked lists to form A
15
Query
Baltimore
16
Extensions
•Can I share my document collection?
•Malicious servers
•Updates
17
Multi-User SSE Encryption
18
Multi-User SSE (cont’d)
• Similar security notions to single-user SSE’s
– Secure indexes and trapdoors
• Revocation: owner can revoke searching
privileges
– Robust against user collusions
• Anonymity: server should not know who initiated
search
• Secure Buyer-Seller Protocol might be useful!
EXAMPLES OF SEARCHING OVER
ENCRYPTED DATA
Searchable Symmetric Encryption: Improved Definitions
and Efficient Constructions---CCS’06, Proceedings of
the Conference on Computer and Communications
security. Reza Curtmola, Juan Garay, Seny Kamara, Rafail Ostrovsky
Privacy-Preserving Multi-keyword Ranked Search over
Encrypted Cloud Data---INFOCOM'11, Shanghai, China,
April 10-15, 2011.Ning Cao, Cong Wang, Ming Li, Kui Ren, and Wenjing Lou
19
Cloud service is convenient!
20
But…
21
Two approaches
22
Cryptography
based approach Other approach
secure efficient
Threat model
• Cloud server is considered as “honest-but-curious”.
• honest---server follows the designed protocol
• curious---server may do extra inference or analysis
• Known ciphertext model:
• Server knew encrypted data and searchable index only
• (both of which are outsourced from data owner)
• Known background model:
• Server possess more information, such as the correlation relationship
among given search requests or the data set related statistical
information
23
Privacy description
• Data privacy: traditional cryptography to encrypt data
• Index privacy: prevent server to learn from search indexes
• Search privacy---
• Keyword privacy: Database is in plaintext form while the query
is encrypted--- challenge: statistical (frequency) analysis attack
• Trapdoor privacy: Server can NOT generate valid trapdoor
function (one-way function used for encryption/decryption).
• Search pattern privacy: Are two queries about the same
keywords?- reduce security space that needs to be analyzed
• Access pattern privacy: not reveal the sequence of search
results to the server
24
Traditional Approach:
•
• Internet
• Client Cloud Server
• ---------------------------Inter --------------------
• X Y
• Encrypt (X)--- Internet --- Stored in Y
• If a search is issued, there are two possible approaches: • (1) Download Encrypt(X) from Y, decrypt it back to X and then do search in
client--- huge computation and storage loads in Client
• (2) Send decryption key to Y, decrypt Encrypt(X) at Y, do search in Y and send the result back to X---both internet and service providers (server) know your secret key
25
Searchable Symmetric Encryption(SSE)
• In 2006, Searchable Symmetric Encryption--- with complete Definitions and Efficient Constructions, was proposed by Curtmola et. al.
• Curtmola’s search scheme can be executed in constant time (O(1)).
• The construction is composed of a combination of a look-up table (T) and an array (A).
A : stores the set D(w) in Encrypted form for each word w
T : stores information that enables one to locate and decrypt the appropriate elements from A, for each w
26
Searchable Symmetric Encryption (SSE)
• more efficient search schemes can be achieved by
revealing certain amount of information--- tradeoffs
between search efficiency and amount of revealed
information!
• Strict Security definition
• A secure SSE scheme should not leak anything beyond the
outcome of a search
• Revisit Security definition
• A secure SSE scheme should not leak anything beyond the
outcome and the pattern of a search
27
SSE needs 4 Algorithms
• Keygen : run by user, probabilistic
• BuildIndex : run by user, possibly probabilistic
• function of (K,D) and the output index I
• Trapdoor : run by user, probabilistic
• generate a trapdoor for a given word w, Tw(K,W)
• Search : run by server, search for D that contains w
• It takes an index I for a collection D and a
• trapdoor Tw for the word w as input and returns
• D(w), the set of identifiers of documents
• containing w.
• Each of the above is a Polynomial-time algorithm!
28
Searchable Symmetric Encryption(SSE)
• Keygen(k)
• K = (s, y, z)
• BuildIndex(K, D)
• D(wi) is the set of identifiers of documents containing the word wi
ordered in lexicographic order – Dictionary!
• Assume the dictionary of the document collection includes three
different words.
‘Where’
‘What’
‘Why’ three linked lists are constructed!
29
Document
identifier
NULL
NULL
NULL
Searchable Symmetric Encryption(SSE)
• Each node generates a key used to encrypt the next node and saves the
index of the next element in the list to the array A.
• This index is generated using a pseudo-random function (hash) and takes
ctr + 1 as its input. (initial ctr = 0 )
30
Linked lists will be ‘scrambled’ into array A
〈id(Dij ), keyij , address(ctr + 1)〉
NULL
Searchable Symmetric Encryption(SSE)
• This tuple is encrypted and then inserted into the array A at index
address(ctr), before ctr is incremented.
• If the array A contains any empty cells, they should be set to random
(padding) values, generated using a pseudo-random function.
• The key for encrypting the first element will be located in
the Look-up table.
31
NULL
〈address(A[N0]), key〉 ⊕ fy(wi)
T[Fz(wi)]
Searchable Symmetric Encryption(SSE)
• Trapdoor(w) • Tw = ( Fz(wi) , fy(wi) )
• Search(D, Tw)
value = T[Fz(wi)]
〈address(A[N0]), key〉= value⊕ fy(wi)
〈id(D), key , address(ctr + 1)〉 for every node
return id(D(wi))
32
T[Fz(wi)]
NULL
However, the system is…
• Not suitable for large scale Cloud data utilization system.
• We need more!
How to do Multi-user Symmetric Searchable Encryption?
6 polynomial-time algorithms are required:
M-Keygen
M-BuildIndex
AddUser
RevokeUser
M-Trapdoor
M-Search
33
What would we need in the system?
34
• Ranked search : Eliminates
unnecessary network traffic – fit for
“ pay-as-you-use” Cloud paradigm
• Multi-keyword search: improves search result accuracy and
enhances user search experience
How to design an efficient encrypted
data search mechanism that
supports multi-keyword semantics
without privacy breaches still
remains a challenging open problem!
So, the problem is…
• Multi-keyword Ranked Search over Encrypted Cloud Data (MRSE) :
Coordinate Matching Principle--- as many matches as possible!
• Documents and queries are described as binary vectors. Each bit in the vector means whether corresponding keyword is contained in the document or the query!
• Use “inner product similarity” to evaluate the similarity
between documents in the database and the query. Basic idea: Secure Inner Product Computation/ Secure K-nearest neighbor (KNN) technique
35
1001010…………… 1001000……………
Document vector Query vector
Wow!
Similar!
System framework
• Basic Architecture:
36
System framework
37
cipher
index
Database
keywords trapdoor Database
We need 4 algorithms again!
• Keygen: taking a security parameter k as input, data owner outputs a
symmetric key as SK
• BuildIndex: Based on the dataset F, data owner builds a searchable
index I, which is encrypted by SK and then outsourced to Cloud server. After
the index construction, the document collection can be independently
encrypted and outsourced.
• Trapdoor: With t keywords of interest in 𝑊 as input, a corresponding
trapdoor (token), 𝑇𝑊 , is generated.
• Search: When Cloud server receives a query request as (𝑇𝑊 ,L), it
performs the ranked search on the index I with the help of trapdoor 𝑇𝑊 and
finally returns 𝐹𝑊 , the ranked ID list of top-L documents stored by their
similarity with 𝑊 .
38
System framework
39
cipher
index
Database
𝐹𝑖 𝐶𝑖
𝐼𝑖
keywords trapdoor Database
𝑊 𝑇𝑊
𝐶𝑊
𝐹𝑊
BuildIndex
Trapdoor
Search
Basic idea
• 𝐷𝑖: binary data vector for document 𝐹𝑖
• Each bit represents the existence of corresponding
keyword in the document.
• Q: query vector, the same as data vector.
• 𝐷𝑖‧Q (inner product) is the similarity score of document i
to the query.
• By this way, we can achieve “multi-keyword” and
“ranked” search.
40
1001010…………… 1001000……………
Document vector Query vector
Keyword 1 exists in
document & query
Recall---Definition of Privacy
• Keyword Privacy: • hide what the user is searching, i.e., the keywords
• indicated by the corresponding trapdoor.
• Trapdoor Privacy ( Unlinkability):
• the cloud server should not able to link the relationship
• of any given trapdoor, e.g., to determine whether the two
• trapdoors are formed by the same search request.
• Access Pattern Privacy: the sequence of search results
• the proposed scheme is not designed to protect access
• pattern for the efficiency concerns
41
MRSE_I Scheme
• Keygen
• BuildIndex
42
1001000……101
5 19 -4 3 2 …. 1
-13 6 2 ….. 2 3 4
………………….
………………….
3 1 68 43 ……3
8 10 -3 7 …. 57
-1 69 2 ….. 3 7 4
………………….
………………….
7 -9 22 11 ……7 d+2
d+2
d+2 d+2
S
𝑴𝟏 𝑴𝟐
d+2
1001010……10
Document vector
d
𝛆𝐢 𝟏
+2
𝐃𝐢
1001000……101 S
𝐃𝐢′
𝐃𝐢′′
-8
9
S[0] = 1, let 𝐷𝑖′[0] + 𝐷𝑖 ′′[0] = 𝐷𝑖 [0]
0
0
S[1] = 0, 𝐷𝑖′[1] = 𝐷𝑖′′[1] = 𝐷𝑖[1] 𝐈𝐢 = {𝐌𝟏𝐓𝐃𝐢′, 𝐌𝟐
𝐓𝐃𝐢′′ }
MRSE_I Scheme
• Keygen: generate secret key {S,𝑀1, 𝑀2} • S: (d+2) bit vector
• 𝑀1, 𝑀2: (d+2) × (d+2) invertible matrices
• BuildIndex: generate index 𝐼𝑖 for document i
• 𝐷𝑖: {𝐷𝑖, 𝜀𝑖, 1}, 𝜀𝑖 is a random number (dummy keyword)
• Split 𝐷𝑖 into two vectors 𝐷𝑖′ and 𝐷𝑖′′:
if S[j] = 0, then 𝐷𝑖′[j] = 𝐷𝑖′′[j] = 𝐷𝑖[j]
if S[j] = 1, then let 𝐷𝑖′[j] + 𝐷𝑖′′[j] = 𝐷𝑖[j]
• 𝐼𝑖 = {𝑀1𝑇𝐷𝑖′, 𝑀2
𝑇𝐷𝑖′′ }: • subindex is built for every encrypted document Ci
43
d: number of keywords
d+2 bit vector
MRSE_I Scheme
• Trapdoor
44
r00r0r0……r0
d
𝒓 𝒕
+2
𝐐
1001000……101 S
𝑸′
𝑸′′
r
r
S[0] = 1, 𝑄′[0] = 𝑄′′[0] = 𝑄[0]
-2
2
S[1] = 0, let 𝑄′[1] + 𝑄′′[1] = 𝑄[1]
𝐓𝐖 = {𝐌𝟏−𝟏𝐐′,𝐌𝟐
−𝟏𝐐′′}
rQ
MRSE_I Scheme
• Search
45
Ii = {M1TDi′, M2
TDi′′ } (on server)
TW = {M1−1Q′,M2
−1Q′′} (client upload)
Ii‧ TW = M1TDi′‧M1
−1Q′ + M2TDi′′ ‧M2
−1Q′′
= Di′‧Q′ + Di′′‧Q′′
𝐃𝐢′
𝐃𝐢′′
𝑸′
𝑸′′
let 𝑄′[j] + 𝑄′′[j] = 𝑄[j]
S[j] = 0, 𝐷𝑖′[j] = 𝐷𝑖′′[j] = 𝐷𝑖[j]
MRSE_I Scheme
• Search
46
Ii = {M1TDi′, M2
TDi′′ } (on server)
TW = {M1−1Q′,M2
−1Q′′} (client upload)
Ii‧ TW = M1TDi′‧M1
−1Q′ + M2TDi′′ ‧M2
−1Q′′
= Di′‧Q′ + Di′′‧Q′′ = Di‧Q
𝐃𝐢′
𝐃𝐢′′
𝑸′
𝑸′′
𝐃𝐢
𝑸
1001010……10 𝛆𝐢 𝟏
r00r0rr0……r0 𝒓 𝒕
= 𝐫 𝑫𝒊 ∙ 𝐐 + 𝜺𝒊 + 𝐭
MRSE_I Scheme
• Trapdoor
• 𝑄 = {rQ, r, t }, r and t are random numbers
• Split 𝑄 into two vectors 𝑄′ and 𝑄′′
S[j] = 1, 𝑄′[j] = 𝑄′′[j] = 𝑄[j]
S[j] = 0, let 𝑄′[j] + 𝑄′′[j] = 𝑄[j]
• 𝑇𝑊 = {𝑀1−1𝑄′,𝑀2
−1𝑄′′}
• Search
47
Properties
• r 𝐷𝑖 ∙ Q + 𝜀𝑖 + t is “nearly” a linear function of 𝐷𝑖 • 𝜀 follows a normal distribution
• Larger variance of 𝜀 may decrease the precision, but increase the
security.
48
Why Privacy is Preserved?
• With the randomness introduced by the splitting process
and the random numbers r and t, the proposed scheme
can generate two totally different trapdoors for the same
query 𝑊 . This nondeterministic trapdoor generation can
guarantee the trapdoor unlinkability which is an unsolved
privacy leakage problem in related symmetric key based
searchable encryption schemes because of the
deterministic property of trapdoor generation.
• With properly selected parameter, even the final score
results can be obfuscated very well, preventing the cloud
server from learning the relationships of given trapdoors
and the corresponding keywords.
49
MRSE_II Scheme
• Keygen: extend S,𝑀1, 𝑀2 to (d+U+1) dimension
• BuildIndex: extend U random numbers instead a random
number in MRSE_I.
• Trapdoor: randomly select V out of U entries in Q, and
set 1 in these position.
• Search: the similarity score is r(𝐷𝑖 ∙ 𝑄+ 𝜀𝑖(𝑣)) + t
50
Conclusion
• To meet the challenging “encrypted data search” problem.
• Two approaches – SSE and MRSE – have been introduced.
• What’s next: • multi-keyword semantics over encrypted data.
• Extend the idea to multimedia.
important references:
W.K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis,” Secure KNN Computation on Encrypted Database,” SIGMOD International Conference on Management of Data, 2009, pp. 139-152.
Privacy Preserving Search on Multimedia (Prof. Min Wu’s work)
51
52
Fig. 1: System model for secure image retrieval