social piggybacking: leveraging common friends to generate event streams

Social Piggybacking: Leveraging Common Friends to

Generate Event StreamsMarco Serafini

joint work with Aris Gionis, Flavio Junqueira, Vincent Leroy and Ingmar Weber

Social Event StreamsBackground

Social event feeds

Major feature - 70% of page views on Tumblr

Generating event streams

front-end

user

Social networking system

data store clients

(application logic)

social graph

data stores

…

Two types of user actions Share an event Generate a new event stream

OptimizationsMaterialized views, one per user

Contain the user’s own events It can also contain events of the other users it

follows

We abstract away application-specific “relevance” filters All views contain all events stored in them All queries to a view return all events in the view

Plato 12.00 “Having the shadow of an ideal sandwich”

Hume 12.01 “I just feel a good taste in my mouth”

Kant 12.02 “Dudes, just eat it and stop blabbering”

VIEW

EVENT

GOAL: Optimizing throughputThroughput of event stream

Proportional to the amount of data being transferred

Partitioning social graphs is impossible (or at least, very very hard)

Existing approaches to optimize throughput Push-all Pull-all Hybrid

Writes to your view only

Read from all your friends’ view

Simpler, good with frequent writes

WRITE from Alice

Pull-all

A

B

C

ClientAlice

Bob

Charlie

Data stores

Alice

READ from Charlie

ClientAlice

Bob

Charlie

Data stores

Charlie

Push-allWrite to all your friends’ views

Read from your view only

Good with frequent reads

A

B

C

WRITEs from Aliceand Bob

Clients Alice

Bob

Charlie

Data stores READ from Charlie

ClientAlice

Bob

Charlie

Data stores

Charlie

Alice

CharlieBob

Hybrid [Silberstein et. al., SIGMOD 2010]

Per-edge choice between pull or push

Uses Production Rate (PR) and Consumption Rate (CR)

Minimum per-edge throughput cost

A B

If PR(A) < CR(B)

PUSHA writes onto B’s

viewCost: PR(A)

A B

If PR(A) ≥ CR(B)

PULLB reads from A’s

viewCost: CR(B)

Request schedule

front-end

user

Social networking system

data store clients

(application logic)

social graph

data stores

…

Social graph contains the Request Schedule Per-edge Push or Pull Easy to integrate in existing system

Social PiggybackingContribution

Idea: Social Piggybacking

Two friends are likely to share many common friends

Their views can be used as HUBS to prune edges

A

B

CFREE EDGE!Neither pull nor push

PUSHA writes new events

onto B’s view PULLC reads events

by B and Afrom B’s view

SOCIAL PIGGYBACKING

HUB

Social Dissemination Problem

Inputs Social Graph Per-node Production and consumption rates

Output: request schedule that minimizes costs Each edge needs to be covered Can be through a hub, push or pull

Requirements Bounded staleness Non-triviality

AnalysisAll admissible request schedule are s.t., for

each edge The edge is served directly, using a push or a

pull, or The edge is served through a hub. Any other schedule is not admissible

The Social Dissemination problem is NP-hard

Nosy: A Simple HeuristicNosy looks for hubgraphs

Cost with Piggybacking : PR(X) + CR(Y), cross edges free

X

Nosy Phase 1Add elements to X sets

For each edge (w, y) Build the largest hubgraph (X, w, y) Piggybacking cost: PR(X) + CR(y) Cross edges X -> y are free Piggyback if cheaper than hybrid

w y

X

X

Nosy Phase 2Add elements to Y sets

For each (w, y) Let Xy be producers of y that

push to w already Piggybacking cost: CR(y) Cross edges Xy -> y are free

Piggyback if cheaper than hybrid

X w y

X Y

ExperimentsFlickr and Twitter graphs

ExperimentsTwitter (Aug 2009) and Flickr (Apr 2008) social

graphs

Samples using random walks, which preserve graph properties

Average sizes Flickr: 4 k nodes, 112 k edges Twitter: 25 k nodes, 158 k edges

Production and consumption rates are generated write:read ratio is 1:5 PR (resp. CR) increases logarithmically with out-

degree (resp. in-degree)

Metrics and ResultsMetric

Improvement over hybrid optimization (baseline) Gain(A) = Cost(BASE) / Cost(A) – 1

Results1. Nosy exploits the community structure

2. It works well under a variety of parameters

Clustering CoefficientAfter sampling, we keep only a fraction s of

edges

B+ is a trivial extension of Baseline Lock push edges Pull edges that can be served using hubs are free

More clustering, more gain for Nosy but not for B+

Varied WorkloadSignificant gains

Asymptotically, i.e. with all reads, the per-edge push-based solution is optimal so the gain tends to zero

Effect of ColocationAs the system size grows, the gains reach their

maximum

For very small systems there is little communication so little room for improvements

ConclusionsSocial Piggybacking is a very promising

approach Baseline has up to 2.4 times higher throughput

cost Easy to integrate in existing systems

Next steps Run on full social graphs Evaluate throughput gain on actual social

networking system

social piggybacking: leveraging common friends to generate event streams

Documents

throughput push

edge w

free edge

social event streamsbackground

social graphs samples

partitioning social

y sets x y

cryx w y cross edges