david lyon week3
TRANSCRIPT
![Page 1: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/1.jpg)
Subreddit Subcultures
Insight Data Engineering Fellowship, Silicon Valley
David Lyon
![Page 2: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/2.jpg)
2007 - Impersonal Web
![Page 3: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/3.jpg)
2017 - Personal Web
![Page 4: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/4.jpg)
Reddit Comment Dataset
2 billion comments
1 million subreddits
![Page 5: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/5.jpg)
Personalization of Reddit Over Time
Reddit Clustering App
https://youtu.be/XHczo0TM17E
![Page 6: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/6.jpg)
Data PipelineIngestion / Processing User Interface
![Page 7: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/7.jpg)
Challenge 1:Data Size
Every month on Reddit:
● Reddit is too big to cluster directly!
● The raw clustering matrix has 200 billion elements.
60k Subreddits
3 million unique authors
![Page 8: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/8.jpg)
Solution 1: Filtering
Every month on Reddit:
● Filter for activity: 100 comments/month
● Active clustering matrix has 200 million elements
● Now 1000 times faster to cluster
6k active Subreddits
30k active authors
![Page 9: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/9.jpg)
Challenge 2:
Every month on Reddit:
● Too many individual authors● Need to cluster by topic, not
author 30k active authors
6k active Subreddits
![Page 10: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/10.jpg)
Solution 2: PCA
Every month on Reddit:
● PCA transforms author space to topic space by finding correlations
● PCA shrinks dimensionality by another 100 times 300
shared topics
6k active Subreddits
![Page 11: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/11.jpg)
Challenge 3: Slow PCAEven on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers.
PCA scales as O(MT)
M is the number of matrix elements
T is the number of topics after PCA
Over 80% of total time!
![Page 12: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/12.jpg)
Solution 3: Random PCAUse Facebook Research Random PCA (2014) on a single node
Fbpca is O(M ln(T))
For 250 topics, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers.
5x faster for an average sized month
![Page 13: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/13.jpg)
David LyonPhD Physics from the University of Illinois
I love hiking, table tennis, and Astrophysics
![Page 14: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/14.jpg)
Solution 3: Silhouette Analysis
● Silhouette Analysis reveals clustering at small k
● Also reveals a second clustering scale of around 400 clusters in this case
![Page 15: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/15.jpg)
Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
![Page 16: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/16.jpg)
Next Steps - Popular Topics by Cluster
● Find the popular topics within each cluster using Term-Frequency Inverse-Document-Frequency (TF-IDF)
● Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.
![Page 17: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/17.jpg)
Challenge 3: Finding K for K-Means
● Number of clusters is not the same as number of PCA topics
● Clustering can happen on more than one scaleFootball
Baseball
TV
Movies
![Page 18: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/18.jpg)
K-Means Clustering
Nearby subreddits in feature space…
Become clustered!Football
Baseball
TV
Movies
![Page 19: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/19.jpg)
Random PCA
● Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
● FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009)
● Fast Randomized SVD (Facebook Research, 2014)● Complexity of Random PCA is O(mn ln(k))● For k=100, Random PCA is more than 20x faster!
![Page 20: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/20.jpg)
Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth999,999
Auth1,000,000
![Page 21: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/21.jpg)
After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
![Page 22: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/22.jpg)
Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by MonthGroup by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has mean=0 and variance = 1
![Page 23: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/23.jpg)
Growth in Number of Subreddits
40 subreddits
1 million subreddits
![Page 24: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/24.jpg)
Week 4 Challenges● Spark for iterative machine learning because Spark can
mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
![Page 25: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/25.jpg)
Clustering is Universal
● Galaxies cluster into superclusters of ~100k members
● The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
![Page 26: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/26.jpg)
Subreddit Clustering
● Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries
● Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored
● Cluster in reduced dimensional space using K-means ● Topics within Clusters based on relative frequency of 1-grams
and 2-grams
![Page 27: David lyon week3](https://reader034.vdocuments.net/reader034/viewer/2022042618/58a26ed21a28ab94628b54a9/html5/thumbnails/27.jpg)
Social media brings us closer
● Continual contact with over 1 billion people● We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend or ban from community!
● Online communities become bubbles isolated from each other