summary of datamining on yelp review...
Post on 23-May-2020
7 Views
Preview:
TRANSCRIPT
Summary of datamining on yelp review data
0 introduction
1. data
1.1 json format raw data
1.2 convert review text from json to table separated value(tsv) file
1.3 clean review text and convert to binary input
2. analysis
2.1 algorithm on text mining
2.2 experiment
3. future work
Yelp Dataset Challenge Yelp connects people to great local businesses. To help people find great local
businesses, Yelp engineers have developed an excellent search engine to sift through over 89 million reviews and
help people find the most relevant businesses for their everyday needs. Yelp is proud to introduce a deep dataset
for research minded academics from our wealth of data. If you’ve been looking for a rich set of data to train your
models on and use in publications, this is it. Tired of using the same standard datasets? Want some real world
relevance in your research project? This data is for you!
• How well can you guess a review's rating from its text alone?
• Can you take all of the reviews of a business and predict when it will be the most busy, or when the business
is open?
• Can you predict if a business is good for kids? Has WiFi? Has Parking?
• What makes a review useful, funny, or cool?
• Can you figure out which business a user is likely to review next?
• How much of a business's success is really just location, location, location?
• What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just “Chinese restaurants”),
and can you learn this from the review text?
• What are the differences between the cities in the dataset?
[ref: https://www.yelp.com/html/pdf/Yelp_Dataset_Challenge_Terms_round_7.pdf]
This study will use yelp review data set to solve the first question: How well can you
guess a review's rating from its text alone? The text mining experiment will follow the
Pang’s paper in 2002.
[ref: Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment
Classification using Machine Learning Techniques". Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86.]
Goal:
1.1 json format raw data
yelp_academic_dataset_review.json
contain 2225213 reviews.
$ wc -l yelp_academic_dataset_review.json
2225213 yelp_academic_dataset_review.json
Review format:
{
'type': 'review',
'business_id': (encrypted business id),
'user_id': (encrypted user id),
'stars': (star rating, rounded to half-stars),
'text': (review text),
'date': (date, formatted like '2012-03-14'),
'votes': {(vote type): (count)},
}
Example:
{
"votes": {
"funny": 0,
"useful": 0,
"cool": 0
},
"user_id": "PUFPaY9KxDAcGqfsorJp3Q",
"review_id": "Ya85v4eqdd6k9Od8HbQjyA",
"stars": 4,
"date": "2012-08-01",
"text": "Mr Hoagie is an institution. Walking in, it does
seem like a throwback to 30 years ago, old fashioned menu board,
booths out of the 70s, and a large selection of food. Their speciality
is the Italian Hoagie, and it is voted the best in the area year after
year. I usually order the burger, while the patties are obviously
cooked from frozen, all of the other ingredients are very fresh.
Overall, its a good alternative to Subway, which is down the road.",
"type": "review",
"business_id": "5UmKMjUEUNdYWqANhGckJw"}
1. data
1.2 convert review text from json to table separated value(tsv) file
• Process json file with Hive. Import yelp_academic_dataset_review.json file into Hadoop and output null \000 separated
file as following:
hive>CREATE TABLE IF NOT EXISTS tall(str string);
hive> LOAD DATA LOCAL INPATH '/home/gpyu/HotelBigdata/yelp/yelpDatasets/yelp_academic_dataset_review.json'
OVERWRITE INTO TABLE tall;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/gpyu/hadoopOUT/' ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\000' SELECT GET_JSON_OBJECT(tall.str, '$.stars'), GET_JSON_OBJECT(tall.str, '$.user_id'),
GET_JSON_OBJECT(tall.str, '$.review_id'), GET_JSON_OBJECT(tall.str, '$.date'), GET_JSON_OBJECT(tall.str, '$.type'),
GET_JSON_OBJECT(tall.str, '$.business_id'), GET_JSON_OBJECT(tall.str, '$.text') FROM tall;
• Convert null separated file to table separated value (tsv) file with program totsv.c.
totsv.c
*GOAL: preprocess the raw data from HIVE to standard tsv (table separated value) file.
*The raw data is n columns null (0x00) separated file. The first column must be one char
*and other column should more one char
*
*first step: replace the newline (0x0A), carriage return (0x0D) and tab (0x09) with space (0x20)
*second step: delimiter for the row.
change the char to newline (0x0A) between tuples. [...0x0A xx 0x00...], xx is decimal 1-5
*third step: delimiter for the column.replace null (0x00) with tab (0x09)
null separated file
Table separated value (tsv) file: review_all_col7.tsv
1.3 clean review text and convert to binary input
To implement these machine learning algorithms on our document data, used the following
standard bag-of-features framework. Let {f1 , . . . , fm } be a predefined set of m features that
can appear in a document; examples include the word “still” or the bigram “really stinks”. Let
ni(d) be the number of times fi occurs in document d. Then, each document d is represented by
the document vector d := (n1(d), n2(d), . . . , nm(d)).
[ref: Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up? Sentiment
Classification using Machine Learning Techniques". Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86.]
• ToBinary.java convert review text to binary input (frequency and
presence)
/** ToBinary.java
*AUTHOR: GUANPING YU
*DATE: 2016-08-05
*VERSION: 0.5
* feature with the number of times occurs in total document > 1%
number of observations can be selected
*GOAL: clean the file, count the word, add negation not_ tag and count
the word again
* Select features by intersection wordcount and dictionary
* transfer text to binary using freqency of the features
*Tag the negation: add not_ to every word between a negation word
("not", isn't, didn't, etc)
*/
Dictionary: dic_pos_neg
A list of positive and negative opinion words or sentiment words for
English (6789 words).
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
*clean file: replace punctuation with space (not include single quote '')
Dec Hex Binary Char Description
33 21 00100001 ! exclamation mark
34 22 00100010 " double quote
35 23 00100011 # number
36 24 00100100 $ dollar
37 25 00100101 % percent
38 26 00100110 & ampersand
40 28 00101000 ( left parenthesis
41 29 00101001 ) right parenthesis
42 2A 00101010 * asterisk
43 2B 00101011 + plus
44 2C 00101100 , comma
45 2D 00101101 - minus
46 2E 00101110 . period
47 2F 00101111 / slash
58 3A 00111010 : colon
59 3B 00111011 ; semicolon
60 3C 00111100 < less than
61 3D 00111101 = equality sign
62 3E 00111110 > greater than
63 3F 00111111 ? question mark
64 40 01000000 @ at sign
91 5B 01011011 [ left square bracket
92 5C 01011100 \ backslash
93 5D 01011101 ] right square bracket
94 5E 01011110 ^ caret / circumflex
95 5F 01011111 _ underscore
96 60 01100000 ` grave / accent*
Review text
Binary input (frequency)
2. analysis2.1 algorithm on text mining
The classification algorithms were used from R caret package.
[ref: https://cran.r-project.org/web/packages/caret/]
• Naive Bayes (method = 'nb'). For classification using package klaR
• Learning Vector Quantization (method = 'lvq'). For classification using package class
• Neural Network (method = 'nnet'). For classification and regression using package nnet
• Neural Networks with Feature Extraction (method = 'pcaNNet')
For classification and regression using package nnet
• Support Vector Machines with Linear Kernel (method = 'svmLinear')
For classification and regression using package kernlab
• Support Vector Machines with Linear Kernel (method = 'svmLinear2')
For classification and regression using package e1071
• Linear Support Vector Machines with ClassWeights (method = 'svmLinearWeights')
For classification using package e1071
• k-Nearest Neighbors (method = 'knn')
For classification and regression
• Support Vector Machines with Radial Basis Function Kernel (method = 'svmRadial')
For classification and regression using package kernlab
• Stochastic Gradient Boosting (method = 'gbm')
For classification and regression using packages gbm and plyr
algorithm on text mining--continue
2.2 experiment and results
• 2k dataset contain only star1 and star5, 1k per each. feature frequency (the number of times fi occurs in whole dataset.)
(01 >= 1; 04 >= 4; 20 >= 20 (1% number of observations)). Discuss the number of features and the prediction accuracy
Model: svmLinear
File: review_2k_binfreq_svm.RData
Confidence Level: 0.95
Accuracy Kappa
freq04
freq20
freq01
0.70 0.75 0.80 0.85 0.90
Accuracy
0.70 0.75 0.80 0.85 0.90
Kappa
Discussion:
The best one is freq01 which has 1767 features, freq20 has 150 features. The accuracy for these two set are almost same (89.3% vs
87.7%). Freq04 has 641 features and accuracy is 86.4%. In order to save computation time, the number of features fi in the wh
ole dataset will be >= 1% number of observations.
• 2k dataset contain only star1 and star5, 1k per each. Frequency of features vs presence of features (freq vs pres)
Model: svmLinear, lvq and gbm
File: review_2k_star2_bin_svm_gbm_lvq.RData
Confidence Level: 0.95
Accuracy Kappa
lvqFreq
lvqPres
svmPres
gbmFreq
svmFreq
gbmPres
0.65 0.70 0.75 0.80 0.85
Accuracy
0.65 0.70 0.75 0.80 0.85
Kappa
Discussion: the frequency and presence of features show similar accuracy for gbm, svmLinear and lvq model. Gbm and
svmLinear models accuracy (88%) higher than lvq model (84%).
• 2k dataset contain only star1 and star5, 1k per each. Another two classification model.
Model: nnet and nb
File: review_2k_star2_bin_nb_nnt.RData
Confidence Level: 0.95
Accuracy Kappa
nbFreq
nbPres
PcaNNetFreq
PcaNNetPres
nnetFreq
nnetPres
0.0 0.2 0.4 0.6 0.8
Accuracy
0.0 0.2 0.4 0.6 0.8
Kappa
Discussion: for naïve Bayes (nb) model both presence and frequency of features showed lower accuracy (freq 51.7%, pres
68.4%). Neural network (nnet) and Neural Networks with Feature Extraction (PcaNNet) showed higher accuracy(87.3~88.1%),
similar with svm and gbm models (88%).
• 2k dataset (a-d four different 2k samples) contain only star1 and star5, 1k per each. To evaluate the model stability, we ran
dom selected four group 2k data from yelp review.
File: review_2k_binfreq_abcd_svm.RData
Model: svmLinear
Confidence Level: 0.95
Accuracy Kappa
a_2k
c_2k
b_2k
d_2k
0.75 0.80 0.85 0.90
Accuracy
0.75 0.80 0.85 0.90
Kappa
Discussion: compare with previous 2k dataset, no matter we select different sample. The prediction accuracies are same for the
same model.
• 5k dataset contain star1 to star5, 1k per each. To evaluate the model for 5 levels (1-5).
File: review_5k_bin_svm_gbm_lvq.RData
Model: svmLinear, gbm and lvq
Confidence Level: 0.95
Accuracy Kappa
lvqFreq
lvqPres
gbmPres
gbmFreq
svmFreq
svmPres
0.20 0.25 0.30 0.35 0.40 0.45
Accuracy
0.20 0.25 0.30 0.35 0.40 0.45
Kappa
Discussion: When level increase (from level 2 to level 5), the model prediction accuracy decrease from 88% to 44%.
• 5k dataset contain star1 to star5, 1k per each. Assign star2 to star1, star4 to star5. To evaluate the model for 3 levels (
1, 3, 5).
File: review_5k_star3_bin_svm_gbm_lvq.RData
Model: svmLinear, gbm and lvq
Confidence Level: 0.95
Accuracy Kappa
lvqFreqL3
lvqPresL3
svmPresL3
svmFreqL3
gbmFreqL3
gbmPresL3
0.4 0.5 0.6 0.7
Accuracy
0.4 0.5 0.6 0.7
Kappa
Discussion: When level 5 decrease to level 3, the prediction accuracy increase. svmLinear model from 44% to 66%.
• 5k dataset contain star1 to star5, 1k per each. Assign star2 and star3 to star1, star4 to star5. To evaluate the model for
2 levels (1, 5).
File: review_5k_star2a_bin_svm_gbm_lvq.RData
Model: svmLinear, gbm and lvq
Confidence Level: 0.95
Accuracy Kappa
lvqPres2a
lvqFreq2a
gbmFreq2a
gbmPres2a
svmFreq2a
svmPres2a
0.4 0.5 0.6 0.7 0.8
Accuracy
0.4 0.5 0.6 0.7 0.8
Kappa
Discussion: The level decrease the accuracy increase. Level 3 to level 2, the accuracy from 66% to 79%.
• 5k dataset contain star1 to star5, 1k per each. Assign star2 to star1, star4 and star3 to star5. To evaluate the star3 to
star1 or star5.
File: review_5k_star2b_bin_svm_gbm_lvq.RData
Model: svmLinear, gbm and lvq
Confidence Level: 0.95
Accuracy Kappa
lvqFreq2b
lvqPres2b
gbmPres2b
gbmFreq2b
svmFreq2b
svmPres2b
0.4 0.5 0.6 0.7 0.8
Accuracy
0.4 0.5 0.6 0.7 0.8
Kappa
Discussion: There is no significant difference when assign star3 to star1 (2a 79%) or star5 (2b 78%) for svmLinear model.
3. Future work
• Read around 200 to 500 review to get further information from the details. Pattern,
differences, feature selection.
• Read recent literature about new models on the text mining.
• How to use the text mining result for the recommendation system.
top related