popularity of online news article
TRANSCRIPT
![Page 1: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/1.jpg)
Online News Popularity Dataset
PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel
![Page 2: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/2.jpg)
01
Introduction
![Page 3: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/3.jpg)
Introduction
• Created to analyze the number of shares depending on the attributes and predict if an article will be popular on the internet or not.
• 39,644 observations• 61 attributes• Mashable website: collected over a 2 year period from Jan 2013 -
Jan 2015 • No missing values, but some topics were unclassified • Target: number of shares
![Page 4: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/4.jpg)
02
Data Set Introduction
![Page 5: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/5.jpg)
Data Set Introduction
Data accuracy
Data Set
Website
843,330 shares
12 videos128 videos
792 shares
0 videos12 videos
![Page 6: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/6.jpg)
Attributes
![Page 7: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/7.jpg)
LDA
The Latent Dirichlet Allocation algorithm was applied to all Mashable texts (known before publication) in order to first identify the five top relevant topics and then measure the closeness of each articles to such topics.• They were named LDA-00…...LDA-04 (undefined topics)• LDAs add up to one per observation• Maximum LDA impurity → overall low shares
• Mean: 1,660 vs 3,395• Median: 1,100 vs 1,400
![Page 8: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/8.jpg)
03
Data Modification And Models
![Page 9: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/9.jpg)
Data ModificationRecoding
Data channel Date of publication
0 Viral
1 Lifestyle
2 Entertainment
3 Business
4 Social Media
5 Technology
6 World
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
![Page 10: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/10.jpg)
Conference Paper• Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400
shares. • Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular• Avoided dealing with a class imbalance problem• Made it into a binary problem
Popular or UnpopularAUC = 0.73
![Page 11: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/11.jpg)
Model 1
• 1500 trees• All attributes
![Page 12: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/12.jpg)
Models - Chosen Attributes
Subjective Opinion Random Forest Importance Highly Correlated (w/ shares)• n_tokens_title• n_tokens_content• average_token_length• summary_channel_value• summary_weekday• LDA_00• LDA_01• LDA_02• LDA_03• LDA_04• global_subjectivity• global_sentiment_polarity• global_rate_positive_words• global_rate_negative_word
s• title_subjectivity• title_sentiment_polarity
• LDA _03• LDA_02• kw_max_avg• kw_avg_avg• summary_channel_value• self_reference_min_shares• self_reference_avg_shares
![Page 13: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/13.jpg)
Models - Chosen Attributes
Random Forest Importance
R2: -1.376
Highly Correlated (w/ shares)
R2: 0.01434R2: 0.0148
Subjective Opinion
![Page 14: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/14.jpg)
04
Data Insights
![Page 15: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/15.jpg)
Data Insights
Publication Day:Most articles published - Tuesday, Wednesday, and Thursday.Least articles published - Weekends.
Channel:Most popular topic is Viral,
followed by Tech and Business.Least popular topic is Social Media.
No. of keywords: Generally between 5 to 10.
![Page 16: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/16.jpg)
Challenges
![Page 17: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/17.jpg)
Challenges
• Understanding the variableswhat is LDA topic #sentimentpolaritykeywords
• Finding relation among attributes and which attributes are important for modelling.
• Numbers in dataset vs. numbers on Mashablesharesvideosimages
• Can’t do boosting because we don’t have a binary outcome
![Page 18: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/18.jpg)
Recommendations
![Page 19: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/19.jpg)
Recommendations
For MashablePublish during the week rather than weekendPublish about world, technology, and business and avoid social media articlesPublish articles closer to the topic (minimize impurity)
For ResearchersAlways identify your attributes Ethically and accurately collecting dataTo get more accurate results, get data about the number of likes and
comments,number of tweets or hashtags, number of URL mentions and to understand thesource of shares
![Page 20: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/20.jpg)
Conclusion
![Page 21: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/21.jpg)
Conclusion
● R2 is very small regardless of the model● Using all attributes is the best combination● Removing attributes, changing number of trees, and
changing classifier does not improve R2 value
![Page 22: Popularity of Online News Article](https://reader035.vdocuments.net/reader035/viewer/2022062522/588129af1a28ab00438b5219/html5/thumbnails/22.jpg)
THANK YOU!
PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel