david lyon week3

Subreddit Subcultures

Insight Data Engineering Fellowship, Silicon Valley

David Lyon

2007 - Impersonal Web

2017 - Personal Web

Reddit Comment Dataset

2 billion comments

1 million subreddits

Personalization of Reddit Over Time

Reddit Clustering App

https://youtu.be/XHczo0TM17E

http://redditcluster.us

http://redditcluster.us

https://youtu.be/XHczo0TM17E

Data PipelineIngestion / Processing User Interface

Challenge 1:Data Size

Every month on Reddit:

● Reddit is too big to cluster directly!

● The raw clustering matrix has 200 billion elements.

60k Subreddits

3 million unique authors

Solution 1: Filtering


● Filter for activity: 100 comments/month

● Active clustering matrix has 200 million elements

● Now 1000 times faster to cluster

6k active Subreddits

30k active authors

Challenge 2:


● Too many individual authors● Need to cluster by topic, not

author 30k active authors


Solution 2: PCA


● PCA transforms author space to topic space by finding correlations

● PCA shrinks dimensionality by another 100 times 300

shared topics


Challenge 3: Slow PCAEven on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers.

PCA scales as O(MT)

M is the number of matrix elements

T is the number of topics after PCA

Over 80% of total time!

Solution 3: Random PCAUse Facebook Research Random PCA (2014) on a single node

Fbpca is O(M ln(T))

For 250 topics, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers.

5x faster for an average sized month

David LyonPhD Physics from the University of Illinois

I love hiking, table tennis, and Astrophysics

Solution 3: Silhouette Analysis

● Silhouette Analysis reveals clustering at small k

● Also reveals a second clustering scale of around 400 clusters in this case

Next Steps - Random PCA for Spark.ml

Step 1: Learn Scala!

Step 2: Contribute to Open Source community

Step 3: Streaming Random PCA?

Next Steps - Popular Topics by Cluster

● Find the popular topics within each cluster using Term-Frequency Inverse-Document-Frequency (TF-IDF)

● Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.

Challenge 3: Finding K for K-Means

● Number of clusters is not the same as number of PCA topics

● Clustering can happen on more than one scaleFootball

Baseball

TV

Movies

K-Means Clustering

Nearby subreddits in feature space…

Become clustered!Football

Baseball

TV

Movies

Random PCA

● Complexity of PCA is O(mnk) for m rows, n input columns, k output columns

● FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009)

● Fast Randomized SVD (Facebook Research, 2014)● Complexity of Random PCA is O(mn ln(k))● For k=100, Random PCA is more than 20x faster!

Before PCA

Football 2 1

Baseball 3 1 15

TV 5 2 22

Movies 1 21 1 2

Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth999,999

Auth1,000,000

After PCA

Football 80 2 1

Baseball 90 3 2

TV 6 80 77

Movies 2 80 20

Sub Sporting Fictional Political

Anatomy of a Reddit Comment

BodyAuthorDate Subreddit

Group by MonthGroup by Subreddit

Count #comments by author per subreddit

Normalize authors so each author has mean=0 and variance = 1

Growth in Number of Subreddits

40 subreddits

1 million subreddits

Week 4 Challenges● Spark for iterative machine learning because Spark can

mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters

change slowly, but time window reduced from monthly to

daily

Clustering is Universal

● Galaxies cluster into superclusters of ~100k members

● The red dot is our galaxy

● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine.

● The big blob to the upper left is Liberal Arts.

Subreddit Clustering

● Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries

● Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored

● Cluster in reduced dimensional space using K-means ● Topics within Clusters based on relative frequency of 1-grams

and 2-grams

Social media brings us closer

● Continual contact with over 1 billion people● We can find people who share our exact interests

...and separates us

● Less tolerance for differences - unfriend or ban from community!

● Online communities become bubbles isolated from each other

david lyon week3

Data & Analytics