load-balanced query dissemination in privacy-aware online communities

24
Emiran Curtmola @ UC San Diego Alin Deutsch @ UC San Diego K.K. Ramakrishnan @ at Divesh Srivastava @ at

Upload: reuben

Post on 08-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Emiran Curtmola @ UC San Diego Alin Deutsch @ UC San Diego. K.K. Ramakrishnan @ at&t Divesh Srivastava @ at&t. Load-Balanced Query Dissemination in Privacy-Aware Online Communities. Motivation. DATA. ONLINE COMMUNITIES. Typical such applications are centralized - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Emiran Curtmola @ UC San DiegoAlin Deutsch @ UC San Diego

K.K. Ramakrishnan @ at&tDivesh Srivastava @ at&t

Page 2: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

SIGMOD, June 2010

DATA ONLINE COMMUNITIES

2

Typical such applications are centralized Hosted online communities Search engines

Limitations Disintermediation of publishers from queriers

Publishers need to give up their data Central site controls visibility of publishers to queriers

Publishers loose their right to privacy

Page 3: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Free data exchange within the community Some users want to remain autonomous

User privacy (i.e., not all users may want to reveal their true identity)▪ Publishers express their opinions anonymously to

avoid association with sensitive or controversial issues (e.g., political, race, religion..)

User autonomy + privacy suggest a decentralized infrastructure

SIGMOD, June 2010 3

Page 4: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Make safer to join and post data for publishers Prevent association of sensitive topics with publishers

that contribute to them even if compromised nodes

Publisher k-anonymity: For every publisher p and data item d, hide p in a

k-protected crowd of publishers: there are at least other k-1 potential publishers of the same d

SIGMOD, June 2010 4

Page 5: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

News & Blogs

Advertised data items about the publisher’s articles

P1 Beijing, Tibet, stocks, poverty, money

P2 Beijing, yak tea, Hong Kong, poverty

P3 Beijing, Tibet, yak tea, Hong Kong, money

P4 Beijing, Olympics, yak tea, stocks, money

P5 Beijing, Olympics, yak tea, stocks, money

P6 Olympics, Tibet, stocks, money

P7 Olympics, yak tea, stocks, money

P8 Olympics, yak tea, stocks, moneyQuery Q1: find the articles mentioning the Olympics in Beijing

Query Q3: find the articles mentioning poverty

Query Q2: find the articles about Tibet

Query Q4: find the articles that give the money in Hong Kong

P3

P8

P7 P6

P1

P2

P4

P5

The community data collection

local XML data

P3 local XML data

P4

local XML data

P8

local XML data

P2

local XML data

P5

local XML data

P1

local XML data

P6

local XML data

P7

SIGMOD, June 2010 5

How to query ad-hoc distributed data sources while preserving user privacy?How to query ad-hoc distributed data sources while preserving user privacy?

Allow publishers keep complete control over their data Disseminate queries in the network, not data Publishers answer queries at their own discretion Published data is not traceable back to publishers even if

compromised nodes

Allow publishers keep complete control over their data Disseminate queries in the network, not data Publishers answer queries at their own discretion Published data is not traceable back to publishers even if

compromised nodes

Page 6: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Infrastructure setup such that Distribution of data Large nr. of decentralized publishers and

consumers User privacy

Efficient query routing (to avoid flooding the network)

SIGMOD, June 2010 6

Page 7: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Build an overlay network to act as a distributed index

Peers are organized into logical query dissemination trees (QDTs)

Use QDTs to disseminate queries using node summaries

P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money

P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money

P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty

P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty

Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong

Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong

242118

1

8

9

1064 17 20 23

132

3 14 16

P4 P5

P6 P7 P8

P3P2P1

router

P publisher

union of its subtrees’ summariesunion of its subtrees’ summaries

SIGMOD, June 2010 7

Page 8: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

1

64

2

3

8

9

P410 17 20 23

242118

13

14 16

P5

P6 P7 P8

P3P2P1

Q3=“poverty”

Q3 Q3 Q3

Q3

Q3

Q3

Q3

Q3

Only P1 and P2

publish articles about poverty …poverty……poverty…

check set inclusion: query into node’s summary

Bloom FilterBloom Filter

SIGMOD, June 2010 8

Pruning

Page 9: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Minimum information at each node▪ No node has global information

▪ Node summaries are vectors of counters (bloom filters) representing hash values of advertised data items

Queries reach publishers in such a manner that users do not know if publisher does not respond vs. does not have matching documents

SIGMOD, June 2010

1

64

2

3

8

9

P410 17 20 23

242118

13

14 16

P5

P6 P7 P8

P3P2P1

poverty…poverty…

9

Q3=“poverty”

Page 10: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

▪ If an edge node is compromised▪ Risk: Individual updates of node summaries (from publishers to edge routers) may expose the publishers

▪ Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and...

SIGMOD, June 2010

1

644

2

3

8

9

P410 17 20 23

242118

13

14 16

P5

P6 P7 P8

P3P2P1

poverty…poverty…

10

Protected crowd

Page 11: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

▪ Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and

use secure-multi party (SMP) computation inside crowds to advertise updates of published terms to the edge routers

SIGMOD, June 2010 11

4

P1P2

P3

+Up

d1

+Up

d1

+Upd

2

+Upd

2

+Upd

3

+Upd

3

+R+R

-R-R

Edge router 4

Publisher 3-anonymous protected crowd

Upd1 +Upd2 +Upd3

Upd1 +Upd2 +Upd3

Page 12: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

▪ If an internal node is compromised▪ Risk: Node summary of advertised terms is exposed → Downstream may contain sensitive content but the crowd of publishers is even bigger now..

SIGMOD, June 2010

1

64

2

33

8

9

P410 17 20 23

242118

13

14 16

P5

P6 P7 P8

P3P2P1

poverty…poverty…

12

Protected crowd

Page 13: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

The tree topology introduces congestion at upper QDT

levelsduring query dissemination

The tree topology introduces congestion at upper QDT

levelsduring query dissemination

How to relieve the congestion? How to relieve the congestion? SIGMOD, June 2010 13

Page 14: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Overlaying multiple logical QDTs over the same underlay network A physical node belongs to multiple

logical QDTs but at different levels

Goal: organize the nodes into QDTs such that the distribution of tree levels for a node is uniform across the QDTs

SIGMOD, June 2010 14

Page 15: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

QDT1 QDT2

QDT3 QDT4

11

11

11

11

SIGMOD, June 2010 15

Page 16: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Partition community data collection into disjoint blocks

Build one QDT tree per block B QDTi groups all publishers with terms in Bi

Routing a query Terms in query determine the relevant blocks Send query to the corresponding QDT Check the full query with publishers

Block

Terms

B1 Beijing , Olympics

B2 Tibet , yak tea

B3 Hong Kong , stocks

B4 poverty , money

…poverty……poverty…

QDT1

QDT2

QDT3

QDT4

SIGMOD, June 2010 16

Q3=“poverty” Q3 falls in B4 use QDT4:

Page 17: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

QDT1 QDT2

QDT3 QDT4

Q3=“poverty”

Q1=“Olympics”, “Beijing”

SIGMOD, June 2010 17

Page 18: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Q4=“Hong Kong”, “money”

Route Q4 on both trees?

Query selectivity optimization techniques: Choose the selective QDT to route on by maintaining

only 1-3% of popular data items (see paper)

Block

Terms

B1 Beijing, Olympics

B2 Tibet, yak tea

B3 Hong Kong, stocks

B4 poverty, money

QDT3

QDT4

SIGMOD, June 2010 18

Page 19: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Our solution Our solution SIGMOD, June 2010 19

Page 20: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Empirical fact: Upper two levels in a QDT are the most congested

Model: cyclical permutation of nodes on the tree levels

nr of QDTs for load balance = nr of legal permutations (i.e.,

without breaking the fairness property)

Fairness property: all routers appear precisely once in the top two levels of any QDT

Fairness property: all routers appear precisely once in the top two levels of any QDT

SIGMOD, June 2010 20

Page 21: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Overall throughput depends heavily on the most congested node

Look at node stress in terms of nr. of messages going into a node: Processing Load at a

node (PLoad) going out of a node: Forwarding Load at a

node (FLoad)

Throughput indicator: compare how far are

SIGMOD, June 2010 21

PP

FF

peak load (k-QDTs)

ideal load (avg. load for 1-QDT =

)nr.msgsnr.nodes

Page 22: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

SIGMOD, June 2010 22

Experiment 1: PLoad for Scribe QDT topology Result: nr. QDTs for load balance found

experimentally coincides with that given by our analytical model

Load balance with▪ How close: 32% closest to ideal PLoad▪ How close: 923% closest to ideal FLoad

To balance FLoad, need node fanouts to be the same

Experiment 2: FLoad for fanout-balanced QDT topologies How close: 18% closest to ideal Pload How close: 130% closest to ideal FLoad

Page 23: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

Propose a novel publishing infrastructure

Empowers publishers to join and post without being associated with (sensitive) content

Generic solution: it extracts the maximum load balance supported by the QDT topology

SIGMOD, June 2010 23

Page 24: Load-Balanced Query Dissemination in Privacy-Aware Online Communities

SIGMOD, June 2010 24