sketching in adversarial environments or sublinearity and cryptography 1 moni naor joint work with:...

Sketching in Adversarial EnvironmentsOr

Sublinearity and Cryptography

1

Moni Naor

Joint work with: Ilya Mironov and Gil Segev

2

Comparing Streams How to compare data streams without storing them?

SBSA

Step 1: Compress data on-line into sketches Step 2: Interact using only the sketches Goal: Minimize sketches, update time, and communication

3

Comparing Streams

Real-life applications: massive data sets, on-line data,... Highly efficient solutions assuming shared randomness

$ Shared randomness $

How to compare data streams that cannot to be stored?

4

Comparing Streams How to compare data streams that cannot to be stored?

Is shared randomness a reasonable assumption? No guarantees when set adversarially Inputs may be adversarially chosen depending on the randomness

$ Shared randomness $

Plagiarism

detection

5

Communication complexity

Adversarial sketch model

“Adversarial” factors: No secrets Adversarially-chosen

inputs

Massive data sets: Sketching, streaming

The Adversarial Sketch Model

6

The Adversarial Sketch Model

Goal: Compute f(A,B) Sketch phase

An adversary chooses the inputs of the parties Provided as on-line sequences of insert and delete operations No shared secrets The parties are not allowed to communicate Any public information is known to the adversary in advance

Adversary is computationally all powerful Interaction phase

small sketches, fast

updates

low communication &

computation

7

Equality testing A, B µ [N] of size at most K Error probability ²

Our Results

If we had public randomness… Sketches of size O(log(1/²)) Similar update time, communication and computation

Equality testing in the adversarial sketch model requires sketches of size (K¢log(N/K))1/2

Lower Bound

8

Equality testing A, B µ [N] of size at most K Error probability ²

Equality testing in the adversarial sketch model requires sketches of size (K¢log(N/K))1/2

Lower Bound

Explicit and efficient protocol: Sketches of size (K¢polylog(N)¢log(1/²))1/2

Update time, communication and computation polylog(N)

Upper Bound

Our Results

9

(1 + ½)-approximation for any constant ½ Sketches of size (K¢polylog(N)¢log(1/²))1/2

Update time, communication and computation polylog(N)

Explicit construction: polylog(N)-approximation

Our Results Symmetric difference approximation

A, B µ [N] of size at most K Goal: approximate |A Δ B| with error probability ²

Upper Bound

10

Outline Lower bound Equality testing

Main tool: Incremental encoding Explicit construction using dispersers

Symmetric difference approximation Summary & open problems

11

Simultaneous Messages Model

x y

f(x,y)

12

x y

Simultaneous Messages Model

Equality testing in the private-coin SM model requires communication (K¢log(N/K))1/2

Lower Bound

[NS96, BK97]

sketches

adversarial sketch model

13




14

Simultaneous Equality Testing

x

C(x)

y

C(y)

Communication K1/2

K

K1/2£K1/2

15

First Attempt

C(A) C(B)row = 3

col = 2

C(B)3,2

Sketches of size K1/2 Problem: update time K1/2

16

Incrementality vs. Distance

Impossible to achieve both properties simultaneously with Hamming distanceHamming distance

High distance:For every distinct A,B µ [N] of size at most K, d(C(A),C(B)) > 1 - ²

Incrementality:Given C(S) and x 2 [N], the encodings of S [ {x} and S \ {x} are obtained by modifying very few entries

logarithmic

constant

17

Incremental Encoding

S C(S)1, ... , C(S)r

d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)}i = 1

r

r=1: Hamming distance Hope: Larger r will enable fast updates r corresponds to the communication complexity of our protocol

Want to keep r as small as possible

Explicit construction with r = logK: Codeword size K¢polylog(N) Update time polylog(N)

Normalized Hamming distance

18

Equality Protocol

rows (3,1,1)

cols (2,3,1), values

{1 – dH(C(A)i,C(B)i)} < ²i = 1

r

C(A)1

C(A)2

C(A)3 C(B)3

C(B)2

C(B)1

Error probability:

1 – d(C(A), C(B))

19

The Encoding Global encoding

Map each element to several entries of each codeword Exploit “random-looking” graphs

Local encoding Resolve collisions separately in each entry A simple solution when |A Δ B| is guaranteed to be small

20

The Local Encoding Suppose that |A Δ B| · ℓ

21

Missing Number Puzzle Let S={1,...,N}\{i} – random permutation over S:

(1),....,(N) as a one-way stream One number i is missing

Goal: Determine the missing number i using O(log N) bits

What if there are ℓ missing numbers?• Can it be done using O(ℓ¢logN) bits?

22

The Local Encoding Suppose that |A Δ B| · ℓ

Associate each x 2 [N] with v(x) such that for any distinct x1,...,xℓ the vectors v(x1),...,v(xℓ) are linearly-independent

C(S) = v(x)x 2 S

If 1 · |A Δ B| · ℓ then C(A) C(B) For example v(x) = (1, x, ..., xℓ-1) Size & update time O(ℓ¢logN)

A simple & well-known solution:

Independent of the size of the sets

23

The Global Encoding Each element is mapped into several entries of each codeword The content of each entry is locally encoded

Universe of size N

C1

C2

C3

24

The Global Encoding

Universe of size N

A

B

12

21

2121

12

Each element is mapped into several entries of each codeword The content of each entry is locally encoded The local guarantee:

If 1 · |Ci[y] Å (A Δ B)| · ℓ then C(A) and C(B) differ on Ci[y]

Consider ℓ = 1

C(A) and C(B) differ at least on

these entries

C1[2]

25

The Global Encoding Identify each codeword with a bipartite graph G = ([N],R,E) For S µ [N] define (S,ℓ) µ R as the set of all y 2 R for which

Universe of size N

S

(K, ², ℓ)-Bounded-Neighbor Disperser:For any S ½ [N] such that K · |S| · 2K it holds that

1 · |(y) Å S| · ℓ

|(S,ℓ)| > (1 - ²)|R|

2

12

2

1

26

The Global Encoding

Universe of size N

A

B

r = logK codewords, each Ci is identified with a (2i, ², ℓ)-BND For i = log2|A Δ B| we have dH(C(A)i,C(B)i) > 1 - ² In particular

d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)} > 1 - ²i = 1

r

C1

C2

C3

Bounded-Neighbor

Disperser

27

Constructing BNDs

Codeword of length M

Universe of size N

Given N and K, want to optimize M, ℓ, ² and the left-degree D

Optimal Extractor Disperser

1 polylog(N)

log(N/K)

M

D

ℓ

2(loglogN)2

K¢log(N/K)

K¢2(loglogN)2 K

polylog(N)

O(1)

(K, ², ℓ)-Bounded-Neighbor Disperser:For any S ½ [N] such that K · |S| · 2K it holds that

|(S,ℓ)| > (1 - ²)|R|

28




29

Symmetric Difference Approximation1. Sketch input streams into codewords2. Compare s entries from each pair of codewords

di - # of differing entries sampled from the i-th pair

3. Output APX = (1 + ½)i for the maximal i s.t. di & (1 - ²)s

A C(A)1, ... , C(A)kB C(B)1, ... , C(B)k

d1 dk

|AΔB|· APX · (1+½)¢ ¢|AΔB|KD(1 - ²)M

non-explicit: » 1explicit:

polylog(N)

30




31

Summary Formalized a realistic model for computation over massive data sets

Communication complexity

Adversarial sketch model

“Adversarial” factors: No secrets Adversarially-chosen

inputs

Massive data sets: Sketching, streaming

32

Summary Formalized a realistic model for computation over massive data sets

Incremental encoding Main technical contribution Additional applications?

Determined the complexity of two fundamental tasks Equality testing Symmetric difference approximation

S C(S)1, ... , C(S)r

d(C(A),C(B)) = 1 - {1 –

dH(C(A)i,C(B)i)}i = 1

r

33

Open Problems Better explicit approximation for symmetric difference

Our (1 + ½)-approximation in non-explicit Explicit approximation: polylog(N)

Approximating various similarity measures Lp norms, resemblance,...

Characterizing the class of functions that can be “efficiently” computed in the adversarial sketch model

The Power of Adversarial Sketching

sublinear sketches

polylog updates Possible approach: public-coins to private-coins transformation

that “preserves” the update time

34

Computational Assumptions

Symmetric difference approximation: Not known Even with random oracles!

Thank you!

Better schemes using computational assumptions?

Equality testing: Incremental collision-resistant hashing [BGG ’94] Significantly smaller sketches Existing constructions either have very long public descriptions, or rely on random

oracles Practical constructions without random oracles?

Can also consider multiple intrusions

Pan-Privacy Model

Data is stream of items, each item belongs to a userData of different users interleaved arbitrarilyCurator sees items, updates internal state, output at stream end

Pan-Privacy For every possible behavior of user in stream, joint distribution of the internal state at any single point in time and the final output is differentially private

state

output

Universe U of users whose data in the stream; x 2 U• Streams x-adjacent if same projections of users onto U\{x}

Example: axbxcxdxxxex and abcdxe are x-adjacent • Both project to abcde• Notion of “corresponding locations” in x-adjacent streams

• U -adjacent: 9 x 2 U for which they are x-adjacent– Simply “adjacent,” if U is understood

Note: Streams of different lengths can be adjacent

Adjacency: User Level

Example: Stream Density or # Distinct Elements

Universe U of users, estimate how many distinct users in U appear in data stream

Application: # distinct users who searched for “flu”

Ideas that don’t work:• Naïve

Keep list of users that appeared (bad privacy and space)• Streaming

– Track random sub-sample of users (bad privacy)– Hash each user, track minimal hash (bad privacy)

Pan-Private Density Estimator

Inspired by randomized response.Store for each user x 2 U a single bit bx

Initially all bx 0 w.p. ½1 w.p. ½

When encountering x redraw bx 0 w.p. ½-ε1 w.p. ½+ε

Final output: [(fraction of 1’s in table - ½)/ε] + noise

Pan-PrivacyIf user never appeared: entry drawn from D0

If user appeared any # of times: entry drawn from D1

D0 and D1 are 4ε-differentially private

Distribution D0

Distribution D1

sketching in adversarial environments or sublinearity and cryptography 1 moni naor joint work with:...

Documents

communication slide

y slide

lower bound slide

adversarial sketch model

constant sketches of

sketches goal

computation equality

sketches step