ankit presentation

of 26 /26
Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References Classification of Sentimental Reviews Using Machine Learning Techniques ICRTC-2015 :3 rd International Conference On Recent Trends In Computing Presented At SRM University Delhi-NCR Campus, Ghaziabad(U.P) Ankit Agrawal Department of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela - 769008, Odisha, India March 9, 2015 1 / 26

Author: ankit-agrawal

Post on 20-Feb-2017

217 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Classification of Sentimental Reviews Using

    Machine Learning TechniquesICRTC-2015 : 3rd International Conference On Recent Trends In

    Computing

    Presented At

    SRM University

    Delhi-NCR Campus, Ghaziabad(U.P)

    Ankit AgrawalDepartment of Computer Science and Engineering,

    National Institute of Technology Rourkela,Rourkela - 769008, Odisha, India

    March 9, 20151 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Sentiment Analysis

    Sentiment mainly refers to feelings, emotions, opinion or atti-tude (Argamon et al., 2009).

    With the rapid increase of world wide web, people often expresstheir sentiments over internet through social media, blogs, rat-ing and reviews.

    Business owners and advertising companies often employ sen-timent analysis to discover new business strategies and adver-tising campaign.

    2 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Machine Learning Techniques

    Machine leaning algorithms are used to classify and predict whethera document represents positive or negative sentiment. Differenttypes of machine learning algorithms are:

    Supervised algorithm uses a labeled dataset where each doc-ument of training set is labeled with appropriate sentiment.

    Unsupervised algorithm include unlabeled dataset (Singh et al.,2007).

    This study mainly concerns with supervised learning techniques ona labeled dataset.

    3 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Types of Sentiment Analysis

    Different Types of Sentiment analysis are as follows:

    Document Level: Document Level sentiment classificationaims at classifying the entire document or review as either pos-itive or negative.

    Sentence Level: Sentence level sentiment classification con-siders the polarity of individual sentence of a document.

    Aspect Level: Aspect level sentiment classification first iden-tifies the different aspects of a corpus and then for each doc-ument, the polarity is calculated with respect to obtained as-pects.

    Document level sentiment analysis is being considered for analysisin this study.

    4 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Review of Related work

    Author(s) Sentiment Analysis Approach

    (Pang et al.,2002)

    They have considered sentiment classification basedon categorization aspect with positive and negativesentiments . They used three different machine learn-ing algorithms i.e., Naive Bayes, SVM , and MaximumEntropy classification applied over the n-gram tech-nique.

    (Turney, 2002). He presented unsupervised algorithm to classify re-view as either recommended (positive) or not rec-ommended (negative). The author has used Part ofSpeech (POS) tagger to identify phrases which con-tain adjectives or adverbs.

    (Read, 2005). He used emotions for labeling of dataset. He usedemotions for labeling because they are independent oftime, topic and domain. He applied machine learningclassifiers on this labeled dataset.

    5 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    continue...

    Author(s) Sentiment Analysis Approach

    (Dave et al.,2003).

    They have used structured review for testing andtraining, identifying features and score methods to de-termine whether the reviews are positive or negative.They used classifier to classify the sentences obtainedfrom web search through search query using productname as search condition.

    (Whitelawet al., 2005).

    They have presented a sentiment classification tech-nique on the basis of analysis and extraction of ap-praisal groups. Appraisal group represents a set ofattribute values in task independent semantic tax-onomies.

    (Li et al.,2011).

    They have proposed various semi-supervised tech-niques to solve the issue of shortage of labeled datafor sentiment classification . They have used undersampling technique to deal with the problem of sen-timent classification i.e., imbalance problem.

    6 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Types of Classification

    Binary sentiment classification: Each document or reviewof the corpus is classified into two classes either positive ornegative.

    Multi-class sentiment classification: Each review can beclassified into more than two classes (strong positive, positive,neutral , negative, strong negative).

    Generally, the binary classification is useful when two productsneed to be compared. In this study, implementation is done withrespect to binary sentiment classification.

    7 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Preparation of Data

    The unstructured textual review data need to be converted to mean-ingful data in order to apply machine learning algorithms.Following methods have been used to transform textual data tonumerical vectors.

    CountVectorizer: Based on the number of occurrences of afeature in the review, a sparse matrix is created (Garreta andMoncecchi, 2013).

    Term Frequency - Inverse Document frequency (TF-IDF):The TF-IDF score is helpful in balancing the weight betweenmost frequent or general words and less commonly used words.Term frequency calculates the frequency of each token in thereview but this frequency is offset by frequency of that tokenin the whole corpus (Garreta and Moncecchi, 2013). TF-IDFvalue shows the importance of a token to a document in thecorpus.

    8 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Machine Learning Algorithms Used

    Naive Bayes Algorithm: It is a probabilistic classifier which usesthe properties of Bayes theorem assuming the strong independencebetween the features (McCallum et al., 1998).For a given textual review d and for a class c (positive,negative),the conditional probability for each class given a review is P(c |d) .According to Bayes theorem this quantity can be computed usingthe following equation 1

    P(c |d) =P(d |c) P(c)

    P(d)(1)

    9 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Continue...

    Support Vector Machine Algorithm: SVM is a non probabilisticbinary linear classifier (Turney, 2002). SVM Model represents eachreview in vectorized form as a data point in the space. This methodis used to analyze the complete vectorized data and the key ideabehind the training of model is to find a hyperplane.The set of textual data vectors are said to be optimally separated byhyperplane only when it is separated without error and the distancebetween closest points of each class and hyperplane is maximum.

    10 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Confusion Matrix

    Confusion matrix is generated to tabulate the performance of anyclassifier.

    Correct Labels

    Positive Negative

    Positive True Positive (TP) False Positive (FP)

    Negative False Negative (FN) True Negative (TN)

    Table: Confusion Matrix

    TP(True Positive) is the number of positive reviews that arecorrectly predicted and FP(False positive) is the number ofpositive reviews predicted as negative.

    TN(True Negative) is number of negative reviews correctlypredicted and FN(False Negative) is number of negative re-views predicted as positive.

    11 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Evaluation Parameters

    Precision : It gives the exactness of the classifier. It is theratio of number of correctly predicted positive reviews to thetotal number of reviews predicted as positive.

    precision =TP

    TP + FP(2)

    Recall: It measures the completeness of the classifier. It is theratio of number of correctly predicted positive reviews to theactual number of positive reviews present in the corpus.

    Recall =TP

    TP + FN(3)

    12 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Continue...

    F-measure: It is the harmonic mean of precision and recall.F-measure can have best value as 1 and worst value as 0. Theformula for calculating F-measure is given below in equation 4

    FMeasure =2 Precision Recall

    Precision + Recall(4)

    Accuracy: It is one of the most common performance eval-uation parameter and it is calculated as the ratio of numberof correctly predicted reviews to the number of total numberof reviews present in the corpus. The formula for calculatingaccuracy is given as equation 5

    Accuracy =TP + TN

    TP + FP + TN + FN(5)

    13 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Proposed Approach

    Dataset

    Preprocessing: Stopword, Numerical and Special character removal

    Vectorization

    Train using machine learning algorithm

    Classification

    Result

    Figure: Steps to obtain the required output

    14 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Dataset

    In this study, labeled aclIMDb movie dataset (IMDb, 2006) isconsidered .

    Dataset contain 12500 labeled positive and negative reviews fortraining of model

    It also contain 12500 positive and negative reviews for testingof model as well.

    15 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Preprocessing

    The review contains a large amount of vague information whichneeded to be eliminated.

    In preprocessing step, firstly, all the special characters used like([email protected]) and the unnecessary blank spaces are removed.

    It is observed that reviewers often repeat a particular characterof a word to give more emphasis to an expression or to makethe review trendy (Amir et al., 2014).

    second step in preprocessing involves the removal of all thestopwords of English language.

    16 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Vectorization

    CountVectorizer: It transforms the review to token count ma-trix. First, it tokenizes the review and according to number ofoccurrence of each token, a sparse matrix is created.

    TF-IDF: Its value represents the importance of a word to adocument in a corpus. TF-IDF value is proportional to thefrequency of a word in a document; but it is limited by thefrequency of the word in the corpus.

    Calculation of TF-IDF value : suppose a movie review contain100 words wherein the word Awesome appears 5 times. The termfrequency (i.e., TF) for Awesome then (5 / 100) = 0.05. Again, sup-pose there are 1 million reviews in the corpus and the word Awesomeappears 1000 times in whole corpus. Then, the inverse documentfrequency (i.e., IDF) is calculated as log(1,000,000 / 1,000) = 3.Thus, the TF-IDF value is calculated as: 0.05 * 3 = 0.15.

    17 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Training and Classification

    Naive Bayes(NB) algorithm: Using probabilistic analysis, fea-tures are extracted from numeric vectors. These features helpin training of the Naive Bayes classifier model (McCallum et al.,1998).

    Support vector machine (SVM) algorithm: SVM plots all thenumeric vectors in space and defines decision boundaries byhyperplanes. This hyperplane separates the vectors in two cat-egories such that, the distance from the closest point of eachcategory to the hyperplane is maximum (Turney, 2002).

    After training of model using above mentioned algorithms, the12500 positive and negative reviews given for testing are classifiedbased on training of model.

    18 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Result NB

    The confusion matrix obtained after Naive Bayes classification isshown in table 2 and evaluation parameters Precision, Recall andF-Measure are shown in table 3 as follows:

    Table: Confusion matrix forNaive Bayes classifier

    Correct Labels

    Positive Negative

    Positive 11025 1475

    Negative 2612 9888

    Table: Evaluation parameter forNaive Bayes classifier

    Precision Recall F-Measure

    Negative 0.81 0.88 0.84

    Positive 0.87 0.79 0.83

    The accuracy for Naive Bayes Classifier is 0.83652

    19 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Result SVM

    The confusion matrix obtained after Support Vector Machineclassification is shown in table 4 and evaluation parametersPrecision, Recall and F-Measure are shown in table 5 as follows:

    Table: Confusion matrix forSupport Vector Machineclassifier

    Correct Labels

    Positive Negative

    Positive 10993 1507

    Negative 1749 10751

    Table: Evaluation parameter forSupport Vector Machineclassifier

    Precision Recall F-Measure

    Negative 0.86 0.88 0.87

    Positive 0.88 0.86 0.87

    The accuracy for Support Vector Machine classifier for unigram is0.86976

    20 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Comparison of Work

    Table: Comparative result obtained among different literature usingIMDb dataset

    Classifier

    Classification Accuracy

    Pang et al. (2002) Salvetti et al. (2004) Mullen and Collier (2004) Beineke et al. (2004) Matsumoto et al. (2005) Proposed approach

    Naive Bayes 0.815 0.796 x 0.659 x 0.83

    Support Vector Machine 0.659 x 0.86 x 0.883 0.884

    The x mark indicates that the algorithm is not considered by the authorin their respective paper.

    21 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Conclusion

    In this study, an attempt has been made to classify sentimentalmovie reviews using machine learning techniques.

    Two different algorithms namely Naive Bayes (NB) and SupportVector Machine (SVM) are implemented.

    It is observed that SVM classifier outperforms every other clas-sifier in predicting the sentiment of a review.

    The result obtained in this study is comparatively better thanother literatures on same dataset.

    22 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Future Work

    In this study, only two different classifiers have been imple-mented.

    In future, other classification strategies under supervised learn-ing methodology like Maximum Entropy classifier, StochasticGradient Classifier, K Nearest Neighbor and others can be con-sidered for implementation.

    Finally, comparison of results can be presented with SVM, whichis currently the best classifier, for the sentiment analysis.

    23 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Reference I

    Amir, S., Almeida, M., Martins, B., Filgueiras, J., and Silva, M. J. (2014). Tugas: Exploiting unlabelled data fortwitter sentiment analysis. SemEval 2014, page 673.

    Argamon, S., Bloom, K., Esuli, A., and Sebastiani, F. (2009). Automatically determining attitude type and forcefor sentiment analysis. In Human Language Technology. Challenges of the Information Society, pages 218231.Springer.

    Beineke, P., Hastie, T., and Vaithyanathan, S. (2004). The sentimental factor: Improving review classification viahuman-provided information. In Proceedings of the 42nd Annual Meeting on Association for ComputationalLinguistics, page 263. Association for Computational Linguistics.

    Dave, K., Lawrence, S., and Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semanticclassification of product reviews. In Proceedings of the 12th international conference on World Wide Web, pages519528. ACM.

    Garreta, R. and Moncecchi, G. (2013). Learning scikit-learn: Machine Learning in Python. Packt Publishing Ltd.

    IMDb (2006). Imdb, internet movie database sentiment analysis dataset.

    Li, S., Wang, Z., Zhou, G., and Lee, S. Y. M. (2011). Semi-supervised learning for imbalanced sentiment classification.In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1826.

    Matsumoto, S., Takamura, H., and Okumura, M. (2005). Sentiment classification using word sub-sequences anddependency sub-trees. In Advances in Knowledge Discovery and Data Mining, pages 301311. Springer.

    McCallum, A., Nigam, K., et al. (1998). A comparison of event models for naive bayes text classification. In AAAI-98workshop on learning for text categorization, volume 752, pages 4148. Citeseer.

    Mullen, T. and Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources.In EMNLP, volume 4, pages 412418.

    Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learningtechniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 7986. Association for Computational Linguistics.

    24 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    Reference II

    Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification.In Proceedings of the ACL Student Research Workshop, pages 4348. Association for Computational Linguistics.

    Salvetti, F., Lewis, S., and Reichenbach, C. (2004). Automatic opinion polarity classification of movie. Coloradoresearch in linguistics, 17:2.

    Singh, Y., Bhatia, P. K., and Sangwan, O. (2007). A review of studies on machine learning techniques. InternationalJournal of Computer Science and Security, 1(1):7084.

    Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification ofreviews. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 417424.Association for Computational Linguistics.

    Whitelaw, C., Garg, N., and Argamon, S. (2005). Using appraisal groups for sentiment analysis. In Proceedings ofthe 14th ACM international conference on Information and knowledge management, pages 625631. ACM.

    25 / 26

  • Introduction Related work Methodology Proposed Approach Conclusion & Future Work Reference References

    ThankYou!

    26 / 26

    IntroductionSentiment AnalysisMachine Learning Techniques Types of Sentiment Analysis

    Related workMethodologyTypes of ClassificationPreparation of DataMachine Learning Algorithms UsedConfusion MatrixEvaluation Parameters

    Proposed ApproachProposed ApproachDatasetPreprocessingVectorizationTraining and ClassificationResult NBResult SVMComparison of Work

    Conclusion & Future WorkConclusionFuture Work

    Reference