joint statistical meeting 2008

Data Analysis at Facebook

Jeff Hammerbacher, Ding Zhou*

Facebook Inc.

Outline• How does Facebook work

• Managing Big Data

• Data Analysis for Business Intelligence

• Data Analysis for “Artificial Intelligence”

• Questions

How does Facebook work?

Profile page - content generation portal

Newsfeed page - content consumption portal

Friends page - social graph portal

App page - social app platform

Facebook Data▪ Social Graph Data

▪ The Nodes:▪ 100m+ users; 100+ dimensions each user (numerical, text, categorical);▪ 350k registrations daily;

▪ The Edges:▪ 200+ friends each user (median);▪ 20 categories of edges (fb friends, co-workers, family, etc);

▪ Social Behavior Data▪ Social Interactions: interactions among users, via 100+ interaction types;▪ Social Actions: between users and 33k+ facebook apps, via 200+ action types;

▪ Social Content Data▪ Content of Posts, Notes, Photos, Video, etc

Managing Big Data▪ Data scale [backend]:

▪ Over 1.3 PB raw capacity in largest cluster;▪ Nearly 2 TB uncompressed data per day;▪ Over 20 TB read/write per day;

▪ Distributed Data management:▪ HDFS/Hadoop (MapReduce in Java);▪ MetaStore (MetaData management);▪ Hive QL (Query language on Hadoop+MetaStore);▪ Usage:

▪ at least 50 engineers have run hadoop jobs ▪ 3,514 Jobs weekly▪ 821 Projections,152 Joins, 800 Aggregates, 600 Loaders weekly

Hadoop - MapReduce in Java

MapReduce Execution Flow[Dean, J and Ghemawat, S, 2004]

facebook data teamuses hadoop fordata analysis

facebook:1data:1team:1

uses: 1hadoop: 1for: 1

data:1analysis:1

for:1hadoop:1team:1uses: 1

analysis:1 data:1data:1facebook:1 analysis:1

data:2facebook:1for:1hadoop:1team:1uses: 1

Data Analysis for Business Intelligence

Data for Business Intelligence▪ General Goal:▪ support growth and monetization strategies, and product decisions

▪ User Behavior Studies▪ NUX: Longitudinal study using LARS and recursive partitioning to identify features predictive

of engagement;▪ Identity*: Unsupervised learning over user session data to identify common usage patterns.

Techniques employed include K-Means, PageRank, dimension reduction methods;

▪ Experimentation Platform▪ Columbus*: Top-level site health metrics; drill down by user groups (country, age, gender...);

▪ Columbus++*: A/B testing for impact of site change on site health metrics;;▪ Reporting System

▪ ad-hoc analysis done by Hive queries* - underlined are projects that Ding Zhou participates in;

ColumbusGeographical bird-view ofgrowth by country

Comparison betweenuser groups

Data Analysis for “Artificial Intelligence”-- predicting user social behavior

who the user will interact with

• predict interactions between friends

• features are user profile and browsing history

• tried linear models and tree models

• applied for search, newsfeed, etc

who the user hasn’t found yet

• missing edge prediction problem

• observations are friend/non-friend pairs

• features include profile and local graph info

• profile info more informative

• graph info supplemental if profile incomplete

what applications the user may like*

* projects that Ding Zhou participates in;

• 33k apps, only 0.1% of them used;

• a different recommendation problem;

• prediction model not applicable, user preference unavailable;

• build a prediction model to infer “user ratings”;

• user-based + item-based recommendation

• how to combine profile, social graph, ratings?

what content is interesting*

* projects that Ding Zhou participates in;

• newsfeed as the main content distribution channel

• stories generated by 100s of social actions: on the site, platform, or the Web

• <0.1% of possible stories are shown

• predictions built on story features, and user browsing history

Challenges in Data

- 100s of TBs of meaningful data available- 1,000s of non-trivial features- sampling not always applicable (e.g. small app has no user data)- prediction requirements▪ models regularly applied for 10 billion novel samples▪ models used on-the-fly for 100k samples in 50 ms

Special Machine Learning Problems

- use machine learning to predict user behavior▪ labels: insufficient; inferred implicitly; imbalanced;▪ features: high-dimensional; strongly correlated; noisy;

- scale requires distributed algorithms▪ in-house implementation of tree ensemble methods (bagging predictors)▪ larger training sets grant performance improvements

- speed and accuracy improvements underway

tip of the iceberg

Questions?

joint statistical meeting 2008

Technology

ding zhou participates

prediction model

social actions

data analysis

user

users

mapreduce

hadoop