social piggybacking: leveraging common friends to generate event streams
DESCRIPTION
Presentation at the Social Networking Systems (SNS) workshop 2012, colocated with Eurosys.TRANSCRIPT
Social Piggybacking: Leveraging Common Friends to
Generate Event StreamsMarco Serafini
joint work with Aris Gionis, Flavio Junqueira, Vincent Leroy and Ingmar Weber
Social Event StreamsBackground
Social event feeds
Major feature - 70% of page views on Tumblr
Generating event streams
front-end
user
Social networking system
data store clients
(application logic)
social graph
data stores
…
Two types of user actions Share an event Generate a new event stream
OptimizationsMaterialized views, one per user
Contain the user’s own events It can also contain events of the other users it
follows
We abstract away application-specific “relevance” filters All views contain all events stored in them All queries to a view return all events in the view
Plato 12.00 “Having the shadow of an ideal sandwich”
Hume 12.01 “I just feel a good taste in my mouth”
Kant 12.02 “Dudes, just eat it and stop blabbering”
VIEW
EVENT
GOAL: Optimizing throughputThroughput of event stream
Proportional to the amount of data being transferred
Partitioning social graphs is impossible (or at least, very very hard)
Existing approaches to optimize throughput Push-all Pull-all Hybrid
Writes to your view only
Read from all your friends’ view
Simpler, good with frequent writes
WRITE from Alice
Pull-all
A
B
C
ClientAlice
Bob
Charlie
Data stores
Alice
READ from Charlie
ClientAlice
Bob
Charlie
Data stores
Charlie
Push-allWrite to all your friends’ views
Read from your view only
Good with frequent reads
A
B
C
WRITEs from Aliceand Bob
Clients Alice
Bob
Charlie
Data stores READ from Charlie
ClientAlice
Bob
Charlie
Data stores
Charlie
Alice
CharlieBob
Hybrid [Silberstein et. al., SIGMOD 2010]
Per-edge choice between pull or push
Uses Production Rate (PR) and Consumption Rate (CR)
Minimum per-edge throughput cost
A B
If PR(A) < CR(B)
PUSHA writes onto B’s
viewCost: PR(A)
A B
If PR(A) ≥ CR(B)
PULLB reads from A’s
viewCost: CR(B)
Request schedule
front-end
user
Social networking system
data store clients
(application logic)
social graph
data stores
…
Social graph contains the Request Schedule Per-edge Push or Pull Easy to integrate in existing system
Social PiggybackingContribution
Idea: Social Piggybacking
Two friends are likely to share many common friends
Their views can be used as HUBS to prune edges
A
B
CFREE EDGE!Neither pull nor push
PUSHA writes new events
onto B’s view PULLC reads events
by B and Afrom B’s view
SOCIAL PIGGYBACKING
HUB
Social Dissemination Problem
Inputs Social Graph Per-node Production and consumption rates
Output: request schedule that minimizes costs Each edge needs to be covered Can be through a hub, push or pull
Requirements Bounded staleness Non-triviality
AnalysisAll admissible request schedule are s.t., for
each edge The edge is served directly, using a push or a
pull, or The edge is served through a hub. Any other schedule is not admissible
The Social Dissemination problem is NP-hard
Nosy: A Simple HeuristicNosy looks for hubgraphs
Cost with Piggybacking : PR(X) + CR(Y), cross edges free
X
Nosy Phase 1Add elements to X sets
For each edge (w, y) Build the largest hubgraph (X, w, y) Piggybacking cost: PR(X) + CR(y) Cross edges X -> y are free Piggyback if cheaper than hybrid
w y
X
X
Nosy Phase 2Add elements to Y sets
For each (w, y) Let Xy be producers of y that
push to w already Piggybacking cost: CR(y) Cross edges Xy -> y are free
Piggyback if cheaper than hybrid
X w y
X Y
ExperimentsFlickr and Twitter graphs
ExperimentsTwitter (Aug 2009) and Flickr (Apr 2008) social
graphs
Samples using random walks, which preserve graph properties
Average sizes Flickr: 4 k nodes, 112 k edges Twitter: 25 k nodes, 158 k edges
Production and consumption rates are generated write:read ratio is 1:5 PR (resp. CR) increases logarithmically with out-
degree (resp. in-degree)
Metrics and ResultsMetric
Improvement over hybrid optimization (baseline) Gain(A) = Cost(BASE) / Cost(A) – 1
Results1. Nosy exploits the community structure
2. It works well under a variety of parameters
Clustering CoefficientAfter sampling, we keep only a fraction s of
edges
B+ is a trivial extension of Baseline Lock push edges Pull edges that can be served using hubs are free
More clustering, more gain for Nosy but not for B+
Varied WorkloadSignificant gains
Asymptotically, i.e. with all reads, the per-edge push-based solution is optimal so the gain tends to zero
Effect of ColocationAs the system size grows, the gains reach their
maximum
For very small systems there is little communication so little room for improvements
ConclusionsSocial Piggybacking is a very promising
approach Baseline has up to 2.4 times higher throughput
cost Easy to integrate in existing systems
Next steps Run on full social graphs Evaluate throughput gain on actual social
networking system