final project - cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · final project...
TRANSCRIPT
![Page 1: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/1.jpg)
Final Project Analyzing Reddit Data to Determine Popularity
![Page 2: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/2.jpg)
2
Project Background: The Problem
Problem: Predict post popularity where the target/label is based on a transformed score metric
Algorithms / Models Applied: • SVC • Random Forests • Logistic Regression
![Page 3: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/3.jpg)
3
Project Background: The DataData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit
Data Structure (22 Columns): •created_utc - Float •score - Integer •domain - Text • id - Integer •title - Text •author - Text •ups - Integer •downs - Integer •num_comments - Integer •permalink (aka the reddit link) - Text •self_text (aka body copy) - Text
• link_flair_text - Text •over_18 - Boolean •thumbnail - Text •subreddit_id - Integer •edited - Boolean • link_flair_css_class - Text •author_flair_css_class - Text • is_self - Boolean •name - Text •url - Text •distinguished - Text
![Page 4: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/4.jpg)
4
Project Background: The Data - RemovedData: The top 1,000 posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count. The data was pulled during August 2013. Data was broken out into 2,500 .csv files that correspond to each subreddit
Data Structure: • created_utc - Float • score - Integer • domain - Text • id - Integer • title - Text • author - Text • ups - Integer • downs - Integer • num_comments - Integer • permalink (aka the reddit link) - Text • self_text (aka body copy) - Text
• link_flair_text - Text • over_18 - Boolean • thumbnail - Text • subreddit_id - Integer • edited - Boolean • link_flair_css_class - Text • author_flair_css_class - Text • is_self - Boolean • name - Text • url - Text • distinguished - Text
![Page 5: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/5.jpg)
5
Reviewing the Data: Subreddit Topics
AnimalsWithoutNecks
BirdsBeingDicks
CemeteryPorn
CoffeeWithJesus
datasets dataisbeautiful
FortPorn
learnpython MachineLearning
misleadingthumbnails
Otters
PenmanshipPorn
PowerWashingPorn
ShowerBeerStonerPhilosophy
talesfromtechsupport
TreesSuckingAtThings
![Page 6: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/6.jpg)
6
Reviewing the Data: Top Domains
Domain'Count'
imgur.com)
youtube.com)
reddit.com)
flickr.com)
soundcloud.com)
quickmeme.com)
i.minus.com)
twi6er.com)
amazon.com)
qkme.com)
vimeo.com)
wikipedia.org)
ny;mes.com)
guardian.co.uk)
bbc.co.uk)
Imgur: 773,969
YouTube: 188,526
Reddit: 25,445
Flickr : 17,854
Soundcloud: 10,397
![Page 7: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/7.jpg)
7
Reviewing the Data: Most Have No Body Text
Posts rely primarily on the title and some related media content from the aforementioned domains - link, gif image, video, etc.
Over 1.6 million posts had no body copy/text or approximately 74% of all posts contained a NaN value
![Page 8: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/8.jpg)
8
Reviewing the Data: Time Based Data
0"
50000"
100000"
150000"
200000"
250000"
300000"
January"
February"
March"
April"
May"
June"
July"
August"
September"
October"
November"
December"
Winter Months Saw a Dip, Fall Could be Underrepresented Given Data Pulled in August
![Page 9: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/9.jpg)
9
Reviewing the Data: Time Based Data
Tuesday is Slightly the Favorite Day to Post, While the Weekend Sees a Dip
0"
50000"
100000"
150000"
200000"
250000"
300000"
350000"
400000"
Monday"
Tuesday"
Wednesday"
Thursday"
Friday"
Saturday"
Sunday"
![Page 10: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/10.jpg)
10
Reviewing the Data: Time Based Data
Reddit While You Work: Post Volume Picks up Around 9/10am, Peeking at 12pm Until Dropping off Throughout the Afternoon
0"
20000"
40000"
60000"
80000"
100000"
120000"
140000"
160000"
12am" 1am" 2am" 3am" 4am" 5am" 6am" 7am" 8am" 9am" 10am" 11am" 12pm" 1pm" 2pm" 3pm" 4pm" 5pm" 6pm" 7pm" 8pm" 9pm" 10pm" 11pm"
![Page 11: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/11.jpg)
0"20000"40000"60000"80000"
100000"120000"140000"160000"180000"200000"
50)99"
100)199"
200)299"
300)399"
400)499"
500)999"
1000)4999"
5000)9999"
10000+"
Score&Counts&
11
Reviewing the Data: Determining Popularity
~15% of posts
Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel
![Page 12: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/12.jpg)
12
Analyzing the Data: Issues
Issue: Given the size of the initial data set (2.5 million rows) and how it expanded upon transformation (CountVectorizer and TFIDF) to almost 100,000 columns, resulted in issues in processing the data locally on my machine. In the end I was only able to get about 1% of the data to run through the algorithms • Even with this smaller sub set of data processes could take anywhere from 30 minutes to several hours, making playing
around with the data extremely hard
Future: Explore platforms that are better at handling large data sets such as PySpark. Tried to process the data with PySpark but ran into technical issues that I couldn’t address in time
![Page 13: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/13.jpg)
13
Analyzing the Data: SVC
0.88000$
0.89000$
0.90000$
0.91000$
0.92000$
0.93000$
0.94000$
Linear$ Poly$ Sigmoid$ RBF$
Accuracy'
0.92000%
0.92200%
0.92400%
0.92600%
0.92800%
0.93000%
0.93200%
0.93400%
0.93600%
0.93800%
0.001% 0.01% 0.1%
Accuracy'w/'Linear'Kernel'
Linear = .9368 C Value of .1 = .9363
from sklearn import svm
![Page 14: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/14.jpg)
14
Analyzing the Data: Regression Trees
from sklearn import ensemble
0.885%
0.89%
0.895%
0.9%
0.905%
0.91%
0.915%
0.92%
0.925%
5% 10% 20% 50% 100% 125% 150%
N Estimators = 125 .922
Max Depth = 250 .924
0.885%
0.89%
0.895%
0.9%
0.905%
0.91%
0.915%
0.92%
0.925%
0.93%
5% 40% 100% 150% 200% 250% 300%
![Page 15: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/15.jpg)
15
Analyzing the Data: Logistic
0.925&
0.93&
0.935&
0.94&
0.945&
0.95&
0.001& 0.01& 0.1& 1& 10& 50&
C Value of 1 = .9471 L1 = .947733 L2 = .947066
0.9469&
0.94695&
0.947&
0.94705&
0.9471&
0.94715&
0.9472&
0.94725&
0.9473&
0.94735&
0.9474&
L1& L2&
from sklearn import linear_model
![Page 16: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/16.jpg)
16
Totally Crushing It!
![Page 17: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/17.jpg)
17
Analyzing the Data: Classification Report
Random Forests
Logistic Regression
SVC
![Page 18: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/18.jpg)
18
Soooo Not Crushing It
![Page 19: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/19.jpg)
19
Feature Reduction: Accuracy
Random Forests - Reduced Features
Logistic Regression - Reduced Features
SVC - Reduced Features
Random Forests - All Features
Logistic Regression - All Features
SVC - All Features
94.71%
92.4%
93.63%
94.3%
94.5%
95.2%
![Page 20: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/20.jpg)
20
Feature Reduction: Classification Report
Random Forests - All Features
Logistic Regression - All Features
SVC - All Features
Random Forests - Reduced Features
Logistic Regression - Reduced Features
SVC - Reduced Features
![Page 21: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/21.jpg)
21
Next StepsDealing with the processing issues: • Learn and try our PySpark
Answer some additional questions: • Reevaluate how I handle the domains
• I originally bucketed domains by their frequency/occurrence in the data set however given the originating domain of the content and the title are the majority of the “post” and the top 15 domains make up the vast majority of the post I want to focus on posts from those ~15 domains to get a better picture on how they explicitly affect popularity
• Run the data with varying n_grams levels • I tried them but they expanded the columns to hundreds of thousands which just seemed to
freeze, so hopefully something like PySpark will help with the processing • Predict sub-reddit/category questions:
• Can I predict category of a post? • Do certain subreddits produce more overall popular content than others? Bears With
Beaks vs. ggggg (what ever the hell that is)
![Page 22: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/22.jpg)
APPENDIX
22
![Page 23: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/23.jpg)
0"20000"40000"60000"80000"
100000"120000"140000"160000"180000"200000"
50)99"
100)199"
200)299"
300)399"
400)499"
500)999"
1000)4999"
5000)9999"
10000+"
Score&Counts&
23
Reviewing the Data: Reevaluate Popularity
~12% of posts
Note - Only about half the data because iPython was unable to run a histogram so needed to export and conduct in excel
~8% of posts
![Page 24: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/24.jpg)
24
Analyzing the Data: SVC
0.688%
0.69%
0.692%
0.694%
0.696%
0.698%
0.7%
0.702%
0.704%
0.706%
0.708%
0.71%
0.001% 0.01% 0.1% 1% 10% 50%
C Value of .1 = 0.7077 Accuracy Score
Confusion Matrix
![Page 25: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/25.jpg)
25
Analyzing the Data: Random Forest
0.77$
0.78$
0.79$
0.8$
0.81$
0.82$
0.83$
5$ 10$ 20$ 50$ 100$ 125$
N Estimators of 100 = 0.8218
Max Depth of 200 = 0.8247
0.775%
0.78%
0.785%
0.79%
0.795%
0.8%
0.805%
0.81%
0.815%
0.82%
0.825%
0.83%
40% 100% 150% 200% 250%
Confusion MatrixAccuracy Score
![Page 26: Final Project - Cloudinaryres.cloudinary.com/general-assembly-profiles/image/... · Final Project Analyzing Reddit Data to Determine Popularity. 2 ... 499" 500)999" 1000)4999" 5000)9999"](https://reader034.vdocuments.net/reader034/viewer/2022050303/5f6c0f1c01baec042b043677/html5/thumbnails/26.jpg)
26
Analyzing the Data: Logistic
C =1, Penalty = L2
0.81%
0.815%
0.82%
0.825%
0.83%
0.835%
0.84%
0.845%
0.85%
0.001% 0.01% 0.1% 1% 10% 50%
C of 1 = .8453
Confusion Matrix