comparative analysis of sentiment analysis techniques · comparative analysis of sentiment analysis...

5
________________________________________________________________________ ISSN (PRINT) : 2320 8945, Volume -2, Issue -1,2014 1 Comparative Analysis of Sentiment Analysis Techniques 1 ChetanKaushik, 2 AtulMishra 1,2 Computer Engg. Department, YMCA University of Science & Technology, Faridabad Email: 1 [email protected], 2 [email protected] AbstractWith the increase in the volume of sentiment rich social media websites, an increasing interest among researchers can be seen regarding Sentimental Analysis and opinion mining. The requirement is to develop a technique that can differentiate between the positive, negative or neutral sentiment underlying an electronic text. By devising an accurate method to identify the sentiments behind any text, one can predict the mood of the people regarding a particular product or service. However, there are various challenges involved in identifying the correct sentiment of a user which are discussed in our research. In this paper, we will discuss various techniques of sentiment analysis and the challenges associated with it. Index TermsMachine learning, Negation, NLP, Semantic orientation. I. INTRODUCTION Sentiment Analysis can be described as a type of Natural Language Processing which includes obtaining the feeling of a user or a group of users expressed in various comments, requests or questions posted by them on the internet. It involves building a system that can collect the user opinions and then examine and classify them according to the polarity of the post. In other words, sentiment analysis aims to determine the view of a speaker or writer on particular subject. Liu [1] defined a sentiment as a quintuple “<oj, fjk, soijkl, hi, tl>, where oj is a target object, fjk is a feature of the object oj, soijkl is the sentiment value of the opinion of the opinion holder hi on feature fjk of object oj at time tl, soijkl is +ve, -ve, or neutral, or a more granular rating, hi is an opinion holder, tl is the time when the opinion is expressed.” Sentiment Analysis has applications in various fields. For example, in marketing it helps in determining the success or failure of a new product launch or any new commercial campaign or determining that which version of a product is liked more in which part of the world. Various companies can use this data to determine their future strategies regarding a particular product or service. Sentimental Analysis can be based on a document, sentence or a phrase. In document based sentimental analysis, sentiment of the whole document is calculated as a whole and summarized according to the polarity. In sentence based sentiment analysis, individual sentences are classified as positive, negative or neutral whereas phrase based sentimental analysis assigns a polarity to the individual phrases contained in a sentence. The first requirement of sentimental Analysis is to find the subject towards which the opinion is expressed. After that the sentiment is classified as positive (which denotes satisfaction or happiness on behalf of user), negative (which shows rejection or disappointment) or neutral (which denotes no strong sentiment involved). Then the sentiment can be given a score which denotes the degree of positive or negative response from the user. There are various challenges involved with sentimental analysis. Subabrata [2] categorized these challenges as following: A. Implicit Sentiment Sometimes a sentence may carry a strong sentiment without containing any sentiment bearing word in it. For e.g. One has to be on a lot of medications to make such a documentary B. Domain Dependency Some words have different polarity when used in different domains. For e.g. The movie was inspired from a Hollywood movie. I got inspired by this book. C. Thwarted Expectations Sometimes the writer builds up a positive context and refute it in the end. For e.g. Excellent performances, very good music, stunning cinematography, all in vain because of lack of imagination of the writer/director. D. Pragmatics The pragmatics of the user needs to be identified. For e.g. It was good to see India destroy Australia in final. The match destroyed my interest in sports. E. World Knowledge Sometimes the knowledge of an entity which is used in the sentence is required to identify the sentiment. For e.g. He is just as good a person as Dracula

Upload: nguyentuyen

Post on 13-Jun-2018

259 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Comparative Analysis of Sentiment Analysis Techniques · Comparative Analysis of Sentiment Analysis Techniques 1ChetanKaushik, ... analysis, sentiment of the ... He is just as good

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________

________________________________________________________________________ ISSN (PRINT) : 2320 – 8945, Volume -2, Issue -1,2014

1

Comparative Analysis of Sentiment Analysis Techniques

1ChetanKaushik,

2AtulMishra

1,2Computer Engg. Department, YMCA University of Science & Technology, Faridabad

Email: [email protected],

[email protected]

Abstract—With the increase in the volume of

sentiment rich social media websites, an increasing

interest among researchers can be seen regarding

Sentimental Analysis and opinion mining. The

requirement is to develop a technique that can

differentiate between the positive, negative or neutral

sentiment underlying an electronic text. By devising

an accurate method to identify the sentiments behind

any text, one can predict the mood of the people

regarding a particular product or service. However,

there are various challenges involved in identifying

the correct sentiment of a user which are discussed in

our research. In this paper, we will discuss various

techniques of sentiment analysis and the challenges

associated with it.

Index Terms—Machine learning, Negation, NLP,

Semantic orientation.

I. INTRODUCTION

Sentiment Analysis can be described as a type of Natural

Language Processing which includes obtaining the

feeling of a user or a group of users expressed in various

comments, requests or questions posted by them on the

internet. It involves building a system that can collect the

user opinions and then examine and classify them

according to the polarity of the post. In other words,

sentiment analysis aims to determine the view of a

speaker or writer on particular subject. Liu [1] defined a

sentiment as a quintuple “<oj, fjk, soijkl, hi, tl>, where oj

is a target object, fjk is a feature of the object oj, soijkl is

the sentiment value of the opinion of the opinion holder

hi on feature fjk of object oj at time tl, soijkl is +ve, -ve,

or neutral, or a more granular rating, hi is an opinion

holder, tl is the time when the opinion is expressed.”

Sentiment Analysis has applications in various fields. For

example, in marketing it helps in determining the success

or failure of a new product launch or any new commercial

campaign or determining that which version of a product

is liked more in which part of the world. Various

companies can use this data to determine their future

strategies regarding a particular product or service.

Sentimental Analysis can be based on a document,

sentence or a phrase. In document based sentimental

analysis, sentiment of the whole document is calculated

as a whole and summarized according to the polarity. In

sentence based sentiment analysis, individual sentences

are classified as positive, negative or neutral whereas

phrase based sentimental analysis assigns a polarity to the

individual phrases contained in a sentence.

The first requirement of sentimental Analysis is to find

the subject towards which the opinion is expressed. After

that the sentiment is classified as positive (which denotes

satisfaction or happiness on behalf of user), negative

(which shows rejection or disappointment) or neutral

(which denotes no strong sentiment involved). Then the

sentiment can be given a score which denotes the degree

of positive or negative response from the user.

There are various challenges involved with sentimental

analysis. Subabrata [2] categorized these challenges as

following:

A. Implicit Sentiment

Sometimes a sentence may carry a strong sentiment

without containing any sentiment bearing word in it. For

e.g. One has to be on a lot of medications to make such a

documentary

B. Domain Dependency

Some words have different polarity when used in

different domains. For e.g.

The movie was inspired from a Hollywood movie.

I got inspired by this book.

C. Thwarted Expectations

Sometimes the writer builds up a positive context and

refute it in the end. For e.g. Excellent performances, very

good music, stunning cinematography, all in vain

because of lack of imagination of the writer/director.

D. Pragmatics

The pragmatics of the user needs to be identified.

For e.g.

It was good to see India destroy Australia in final.

The match destroyed my interest in sports.

E. World Knowledge

Sometimes the knowledge of an entity which is used in

the sentence is required to identify the sentiment. For e.g.

He is just as good a person as Dracula

Page 2: Comparative Analysis of Sentiment Analysis Techniques · Comparative Analysis of Sentiment Analysis Techniques 1ChetanKaushik, ... analysis, sentiment of the ... He is just as good

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________

________________________________________________________________________ ISSN (PRINT) : 2320 – 8945, Volume -2, Issue -1,2014

2

One has to know about Dracula to understand the correct

sentiment behind this sentence.

F. Subjectivity Detection

It is important to differentiate between sentiment rich and

neutral sentences from each other. For e.g.

I love Tokyo.

I hate the movie „love in Tokyo‟.

G. Entity Identification

There may be multiple entities in a sentence it is

important to identify that the sentiment is directed

towards which entity.

Chelsea is better than Man. Utd.

This statement is +ve for Chelsea and –ve for Man. Utd.

H. Negation

Handling negation is very difficult. One method is to

reverse the polarity of every word that comes after a

negative word (e.g. not). For e.g.

I do not like this movie.

However this method will fail for-

Not only was the food delicious, the service was

excellent.

II. FEATURES FOR SEMANTIC

ANALYSIS

Feature engineering is a basic task in performing

sentiment analysis. It includes converting a piece of text

into a feature vector. This section includes some

commonly used features for sentiment analysis.

A. Term frequency and term presence

Term frequency refers to the number of times a term is

repeated in a piece of text. It is considered to be very

important in conventional text classification tasks. But in

sentiment analysis it is observed that term presence bears

more importance then term frequency because sometimes

the presence of a single term can reverse the polarity of

the whole sentence.

B. Term position

Sometimes the terms appearing at one section of the text

contains more weight than the terms appearing at a

different section. For example a negation at the beginning

of the sentence can change the meaning of the entire

sentence. Generally terms at the first and last few

sentences in a text are given more weightage then the

terms which appear in the middle.

C. N-gram features

N-grams are used widely in natural language processing

tasks for identifying context. However it is not clear that

whether higher order N-grams perform better than the

lower order N-grams or not.

D. Subsequence kernels

Generally, word or sentence level modes are used for

sentiment analysis. Bickel [3] used a method in which

subsequence kernels were used to implicitly capture the

feature space.

The word subsequence kernel of order n are weighted

some of all word sequences of length n that occur in both

the strings which are being compared.

The mathematical formula is

Where i refers to a vector of length n that consists of the

indices of string s that correspond to the subsequence u. λ

is a kernel parameter similar to gap penalty and i[n] − i[1]

+ 1 is the total length of the span of s that constitutes a

particular occurrence of the subsequence u.

E. Adjectives only

Adjectives are the most commonly used features in

sentiment analysis. People generally use adjectives to

depict their sentiments and high accuracy is observed in

sentiment analysis techniques that focus on only

adjectives for sentiment analysis.

F. Adjective – Adverb Combination

Adverbs generally has no polarity, but when added to an

adjective they can contribute heavily in determining the

polarity of a sentence.

Benamara[4] showed how adverbs can alter the sentiment

value of a sentence and can be classified as

1. Adverbs of affirmation: certainly, totally

2. Adverbs of doubt: maybe, probably

3. Strongly intensifying adverbs: exceedingly,

immensely

4 .Weakly intensifying adverbs: barely, slightly

5. Negation and minimizers: never

Two types of AACs were defined by the work:

Unary AAC: contains one adjective and one adverb

Binary AAC: contains more than one adjective and

adverb

III. RELATED WORK

A lot of research has been done on Sentiment Analysis or

Opinion Mining of data. These studies focus on

determining the correct sentiment behind an electronic

text. Most of these approaches can be classified under

two types – machine learning and semantic orientation.

This section discusses the existing work on both of these

approaches.

Page 3: Comparative Analysis of Sentiment Analysis Techniques · Comparative Analysis of Sentiment Analysis Techniques 1ChetanKaushik, ... analysis, sentiment of the ... He is just as good

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________

________________________________________________________________________ ISSN (PRINT) : 2320 – 8945, Volume -2, Issue -1,2014

3

A. Machine learning

A machine learning strategy involves two sets of

documents. A training set and a test set. The machine

learning algorithm first needs to be trained for both

Supervised learning tasks (like classification, prediction

etc.) and unsupervised learning tasks (clustering etc.).

In training phase the algorithm is trained with some

particular inputs so that later on it can be tested for

unknown inputs. The objective is to train our algorithm in

such a way that later it becomes able to classify new

unknown inputs. There are several machine learning

methods which are being used for Sentiment Analysis.

Some of them are discussed in this section.

Naïve Bayes is one of the most effective and simple

approach amongst them. It is widely used as an algorithm

for classification of text (Melville [5],Rui [6],Ziqiong

[7],Songho [8],Qiang [9] and Smeureanu[10]). In this

approach, first the prior probability of an entity being a

class is calculated and the final probability is calculated

by multiplying the prior probability with the likelihood.

The method is naïve in the sense that it assumes every

word in the text to be independent. This assumption

makes it easier to implement but less accurate.

Another approach is Support Vector Machines (SVM). It

is also used for text classification based on a

discriminative classifier (Rui [6],Ziqiong [7],Songho [8],

and Rudy [11]). The approach is based on the principle of

structural risk minimization. First the training data points

are separated into two different classes based on a

decided decision criteria or surface. The decision is based

on the support vectors selected in the training set. Several

different variants of SVM are used, one of them is a

multiclass SVM used for Sentiment Analysis [12].

The centroid classification algorithm [8] first calculates

the centroid vector for every training class. Then the

similarities between a document and all the centroids are

calculated and the document is assigned a class based on

these similarities values.

The K-Nearest Neighbor (KNN) approach [8] finds the K

nearest neighbors of a text document among the training

documents. The classification is done on the basis of the

similarity score of the class to the neighbor document.

Winnow is another commonly used approach. The

system first predicts a class for a particular document and

then receives feedback. If a mistake is detected then the

system updates its weight vectors accordingly. This

process is repeated over a collection of sufficiently large

set of training data.

Rudy [11] proposed a method based on a combined

approach which included rule based classification,

supervised learning and machine learning. A 10 fold

cross validation was carried out for each sample set. A

hybrid classification method is used in which several

classifiers work together. If the first classifier fails to

classify then it is passed on to the next classifier. The

process continues until the document is classified or there

is no other classifier left.

Ensemble technique [6] combines the output of several

classification methods into a single integrated output.

Zhu [13] proposed an approach based on artificial neural

networks to divide the document into positive, negative

and fuzzy tone. The approach was based on recursive

least squares back propagation training algorithm.

Long-Sheng [14] combined the advantages of machine

learning and information retrieval techniques using a

neural network based approach.

B. Semantic Orientation

The semantic orientation approach is based on

unsupervised learning. It doesn‟t require any training in

order to classify the sentiment data. It is used to measure

how much positive or negative is the word‟s polarity.

Kamps[15] made use of lexical relations to perform

sentiment analysis.

Andrea [16] proposed a method based on semi supervised

learning, which introduced a seed set and expanded it

later using Word Net. The assumption was that the words

with similar orientation have similar polarity.

Chunxu[17] proposed a method to perform sentiment

analysis on content whose contextual information is not

known in advance. In this method other related contents

were used to extract the required contextual information

and then used the information for determining the

orientation of the opinion.

Ting-Chun [18] proposed an unsupervised learning

algorithm based on part of speech (pos) pattern. They

used the sentiment phrase as a query for a search engine

and sentiments were predicted based on the search

results.

Gang [19] used TF-IDF (term frequency – inverse

document frequency) weighing for sentiment analysis.

They used K- means clustering on raw data, and then a

voting mechanism to further stabilize the clustering.

Multiple implementations of the process was applied to

classify the documents in to positive and negative groups.

Prabhu [20] used a simple lexicon based technique on

twitter data by identifying and extracting sentiments from

hashtags and emoticons.

IV. COMPARISON AND EVALUATION

The performance of various sentiment analysis

techniques was measured on the basis of accuracy. That

is, what percentage of text was accurately classified by

the sentiment analysis technique? The performance of

different studies discussed earlier are represented in fig 1

and a brief comparison different techniques used in them

is shown in table 1. The sources used for evaluation is

mostly movie reviews or product reviews.

It was observed that movie review is a more challenging

Page 4: Comparative Analysis of Sentiment Analysis Techniques · Comparative Analysis of Sentiment Analysis Techniques 1ChetanKaushik, ... analysis, sentiment of the ... He is just as good

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________

________________________________________________________________________ ISSN (PRINT) : 2320 – 8945, Volume -2, Issue -1,2014

4

task as compared to product reviews because people use

more ironic terms while writing movie reviews hence

movie review sentiment analysis is a much more

challenging task.

From the performance evaluation, it is difficult to choose

one particular technique that stands out, since each

method used different sources for training and collection

of document with varying text granularity and feature

selection methods.

However it is observed that the machine learning

approaches show more accuracy then semantic

orientation approaches. But machine learning approaches

require more time for training. Semantic orientation on

the other hand is more useful for real time applications.

Fig.I - Performance of different studies

Table I - Summary and comparison of various sentiment analysis techniques

S.

No.

Technique Learning

Methodology

Advantages Disadvantages Study Accuracy

1 SVM

Supervised Very high accuracy

Lesser overfitting

Robust to noise

Incapable of multiclass

classification

Computationally expensive

Slow

KaiquanXu(2011) 61

Rui Xia (2011) 86.4

Ziqiong (2011) 93

Pang and Lee (2004) 86.4

2 Naïve

bayes

Supervised Faster training and

classification

Not sensitive to irrelevant

features

Handles streaming data well

Assumes independence of

feature

Less accurate than SVM

Rui Xia (2011) 85.8

XueBai (2011) 92

Gamon (2005) 86

Pang and Lee (2004) 86.4

3 Centroid

classifier

Supervised Low computation cost

High dimensional data set

Can combine multiple

features together

Term dependency within class

Too sensitive to the training

data

Large number of features in

feature vector

Songhotan (2008)

90

4 KNN

Supervised Very fast training

Simple and easy to

understand

Robust to noisy training data

Handles large data set well

Biased by value of K

High computation complexity

Gets easily fooled by

irrelevant attributes

5 Winnow

classifier

Supervised Mistake driven approach

More sensitive to

relationship among features

Weights of only active

features are updated

Less precise than SVM

Tuning not robust on different

training collections

6 K – means

clustering

Unsupervised Faster than supervised

learning methods

Easy to implement

Produces tight clusters

Less accurate than supervised

learning

Difficult to predict value of K

Doesn‟t work well for clusters

of different sizes and density

Gang li (2010)

78

0

10

20

30

40

50

60

70

80

90

100

Accuracy

Page 5: Comparative Analysis of Sentiment Analysis Techniques · Comparative Analysis of Sentiment Analysis Techniques 1ChetanKaushik, ... analysis, sentiment of the ... He is just as good

ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE)

________________________________________________________________________

________________________________________________________________________ ISSN (PRINT) : 2320 – 8945, Volume -2, Issue -1,2014

5

V. CONCLUSIONSentiment analysis is being used for different

applications and can be used for several others in future.

It is evident from the above discussion that no

classification method outperforms others consistently.

It is also observed that different techniques can be

combined to overcome each other‟s limitation and

provide a better classification all around. More work is

needed in order to further improve the classification

techniques. Several problems such as handling of implicit

product features and dealing with negation etc. are still

not completely resolved.

REFERENCES [1] B. Liu. Sentiment Analysis and Subjectivity.

Handbook of Natural Language Processing, Second

Edition, (editors: N. Indurkhya and F. J. Damerau),

2010.

[2] Subhabrata Mukherjee, Dr. Pushpak Bhattacharyya,

Indian Institute of Technology, Bombay,

Department of Computer Science and Engineering,

June 2012.

[3] Bickel, S.Bruckner, M.Scheffer,“Discriminative

learning for differing training and test

distributions”, International Conference on

Machine Learning, 2007.

[4] Benamara, Farah, Carmine Cesarano, Antonio

Picariello, Diego Reforgiatoand VS Subrahmanian,

“Sentiment analysis: Adjectives and adverbs are

better than adjectives alone” International

Conference on Weblogs and Social Media,

ICWSM, Boulder, CO. 2007.

[5] Melville, WojciechGryc, “Sentiment Analysis of

Blogs by Combining Lexical Knowledge with Text

Classification”, KDD‟09, June 28–July 1, 2009,

Paris, France.Copyright 2009 ACM

978-1-60558-495-9/09/06.

[6] Rui Xia, ChengqingZong, Shoushan Li, “Ensemble

of feature sets and classification algorithms for

sentiment classification”, Information Sciences 181

(2011) 1138–1152.

[7] Ziqiong Zhang, Qiang Ye, Zili Zhang, Yijun Li,

“Sentiment classification of Internet restaurant

reviews written in Cantonese”, Expert Systems with

Applications xxx (2011) xxx–xxx.

[8] Songbo Tan, Jin Zhang, “An empirical study of

sentiment analysis for chinese documents”, Expert

Systems with Applications 34 (2008) 2622–2629.

[9] Qiang Ye, Ziqiong Zhang, Rob Law, “Sentiment

classification of online reviews to travel

destinations by supervised machine learning

approaches”, Expert Systems with Applications 36

(2009) 6527–6535.

[10] Ion SMEUREANU, Cristian BUCUR, “Applying

Supervised Opinion Mining Techniques on Online

User Reviews”, InformaticaEconomică vol. 16, no.

2/2012.

[11] Rudy Prabowo, Mike Thelwall, “Sentiment

analysis: A combined approach.” Journal of

Informetrics 3 (2009) 143–157.

[12] KaiquanXu , Stephen Shaoyi Liao , Jiexun Li,

Yuxia Song, “Mining comparative opinions from

customer reviews for Competitive Intelligence”,

Decision Support Systems 50 (2011) 743–754.

[13] ZHU Jian , XU Chen, WANG Han-shi, “"

Sentiment classification using the theory of ANNs”,

The Journal of China Universities of Posts and

Telecommunications, July 2010, 17(Suppl.): 58–62

.[16] Ziqiong Zhang, Qiang Ye, Zili Zhang, Yijun

Li, “Sentiment classification of Internet restaurant

reviews written in Cantonese”, Expert Systems with

Applications xxx (2011)

[14] Long-Sheng Chen, Cheng-Hsiang Liu, Hui-Ju Chiu,

“A neural network based approach for sentiment

classification in the blogosphere”, Journal of

Informetrics 5 (2011) 313–322.

[15] Kamps, Maarten Marx, Robert J. Mokken and

Maarten De Rijke, “Using wordnet to measure

semantic orientation of adjectives”, Proceedings of

4th International Conference on Language

Resources and Evaluation, pp. 1115-1118, Lisbon,

Portugal, 2004.

[16] Andrea Esuli and FabrizioSebastiani, “Determining

the semantic orientation of terms through gloss

classification”, Proceedings of 14th ACM

International Conference on Information and

Knowledge Management,pp. 617-624, Bremen,

Germany, 2005.

[17] Chunxu Wu, LingfengShen, “A New Method of

Using Contextual Information to Infer the Semantic

Orientations of Context Dependent Opinions”, 2009

International Conference on Artificial Intelligence

and Computational Intelligence.

[18] Ting-Chun Peng and Chia-Chun Shih , “An

Unsupervised Snippet-based Sentiment

Classification Method for Chinese Unknown

Phrases without using Reference Word Pairs”, 2010

IEEE/WIC/ACM International Conference on Web

Intelligence and intelligent Agent Technology

JOURNAL OF COMPUTING, VOLUME 2,

ISSUE 8, AUGUST 2010, ISSN 2151-9617 .

[19] Gang Li, Fei Liu, “A Clustering-based Approach on

Sentiment Analysis”, 2010, 978-1-4244-6793-8/10

©2010 IEEE.

[20] PrabuPalanisamy, VineetYadav, HarshaElchuri,

“Serendio: Simple and Practical lexicon based

approach to Sentiment Analysis”, Serendio

Software Pvt Ltd, 2013.