empowering businesses using yelp reviews mining
TRANSCRIPT
![Page 1: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/1.jpg)
Empowering Businesses using YelpReviews Mining
Vipul MunotPralhad SapreNishant Salvi
Neelam TikoneRutuja Kulkarni
Fall 2016
Advisor Prof. Xiaozhong Liu
Z534 ILS Search
![Page 2: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/2.jpg)
Yelp Dataset Mining
Agenda
• Task 1 -Predicting categories of a business (multi-class and multi-label)
• Task 2 - Predict pros and cons of a business
(topic modelling)
2
![Page 3: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/3.jpg)
Yelp Dataset Mining
Technologies
• MongoDB• Python• Gensim• NLTK• Scikit-learn• TextBlob• R
3
![Page 4: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/4.jpg)
Yelp Dataset Mining
Exploratory Data Analysis
• Total number of Reviews: 2685066• Total businesses : 85901
4
![Page 5: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/5.jpg)
Ratings for Businesses
5
![Page 6: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/6.jpg)
Top CategoriesRestaurants : 26729 Shopping : 12444Food : 10143Beauty : 7490Health & Medical : 6106Home Services : 5866Nightlife : 5507
6
![Page 7: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/7.jpg)
Data Preprocessing
• Merging the reviews and businesses using Business id’s.
• Merge all the reviews into a Passage for that Business id.
• Remove stop words from reviews.• Use TF-IDF to create the word vector.• Class labels : All categories for that business id.
7
![Page 8: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/8.jpg)
Data
Multi-Class:Hotel Chocolate - [Coffee &, Tea, Food, Cafes, Chocolatiers & Shops, Specialty Food, Event Planning & Services, Hotels Travel, Hotels, Restaurants]
Multi-Label:Prediction of at least 45% of distinct labels which “Hotel Chocolate” have.
8
![Page 9: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/9.jpg)
Task 1
Prediction of Business categories using reviews
• Naive Bayes• Logistic Regression• Random Forest
We built the Naive Bayes classifier ground up. For the rest we used scikit-learn
9
![Page 10: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/10.jpg)
Challenges Faced
• Multi-class classification• Adapting existing classifiers (one vs all) • Preprocessing the data (engineering problem)• Defining own Accuracy function based on
nature of Problem (Partial accuracy - 45% in our case)
• Labels assigned are not mutually exclusive• There is an inherent class hierarchy - could be
learned by association rule mining
10
![Page 11: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/11.jpg)
Adapting classifiers to multi-class, multi-label problems• Make probabilistic prediction• Take top 7 categories• Accurate += 1
if (prediction ∩ truth) > len(truth) * 0.45• This is the idea of partial match
e.g
11
"predicted_labels" : [ "Automotive", "Oil Change Stations", "Auto Repair", "Tires", "Shopping", "Auto Parts & Supplies", "Gas & Service Stations" ]
"labels" : [ "Automotive", "Auto Parts & Supplies" ]
![Page 12: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/12.jpg)
Evaluation Metrics• Hamming Loss:
Fraction of the wrong labels to the total number of labels
• Hamming Score : Number of correct labels divided by the union of predicted andtrue labels
12
![Page 13: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/13.jpg)
Evaluation metrics (contd.)• Precision :
The fraction of retrieved instances that are relevant.
• Recall :The fraction of relevant instances that are retrieved.
13
Naive BayesAvg precision - 0.33Avg recall - 0.80Avg hamming score - 0.31
![Page 14: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/14.jpg)
Performance of classifiers (Partial Label Match - 45%)
14
Businesses Classifier (One vs All) Accuracy (75%-25% split)
0 - 80000 (full set) Naive Bayes 90.94
0 - 20000 (537K reviews) Random Forest 76.18
20000 - 40000 (897K reviews) Random Forest 75.97
20000 - 40000 (897K reviews) Logistic Regression 90.64
40000 - 60000 (574K reviews) Random Forest 68.62
40000 - 60000 (574K reviews) Logistic Regression 89.61
![Page 15: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/15.jpg)
Yelp Dataset Mining
Task 2: Objectives
• The prediction goal was to figure out the words, phrases, ratings, and patterns that predict pros and cons of the business.
• Also we extract the good and bad features for every
restaurant which can help in providing suggestions to yelp users.
15
![Page 16: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/16.jpg)
Yelp Dataset Mining
Task 2: Tools and Techniques• Gensim: used for applying the LDA algorithm • TextBlob: used for assigning POS tags• NLTK: used for removing stop words, extracting nouns
and creating bag of words• LDA (Latent Dirichlet algorithm): used for grouping
similar terms from negative and positive reviews together and associating a name to that grouping.
16
![Page 17: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/17.jpg)
17
Task 2: Build Model
![Page 18: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/18.jpg)
18
Task 2: Utilize Model
![Page 19: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/19.jpg)
Task 2: Analysis
Top 10 Good Topics 1. Customer Service2. Food3. Bar & Liquor4. Overall Quality5. Mexican Food6. Breakfast7. Ambiance & Hospitality8. Expensive9. Location10. Entertainment
19
![Page 20: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/20.jpg)
Task 2: Analysis
Top 10 Bad Topics 1. Staff and Service2. Coffee and Cake3. Ambiance and Hospitality4. Bad Service5. Pet Friendliness6. Delivery Services7. Entertainment8. Parking and Utilities9. Food10. Mexican Food
20
![Page 21: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/21.jpg)
Yelp Dataset Mining
Task 2: Results• Business Id - 1vQLTKwmcmZXtNzfKEvMmA• Good points-
Food, Mexican Food, Overall Quality Bad Points -
Delivery Services, Staff and Service
21
![Page 22: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/22.jpg)
Yelp Dataset Mining
Future Scope
• Association rules to define hierarchy of labels
• Device formula to convert good and bad topics into rating
• Human feedback for task 2 to evaluate.
22
![Page 23: Empowering Businesses using Yelp Reviews Mining](https://reader031.vdocuments.net/reader031/viewer/2022030305/58719dec1a28ab044e8b635f/html5/thumbnails/23.jpg)
Yelp Dataset Mining
Questions?
23
Thank You