twitter sentiment analysis on...

Twitter sentiment analysis on immigration

Author: Supervisor:

Radu Bogdan Pertescu Dr. Sophia Ananiadou

A project report submitted to the University of Manchester for the Bachelor of

Science with Industrial Experience as part of the Third Year Project

(COMP30040).

2

Abstract

Social media plays an important role in how people communicate these days. Many users of

(delete)social media platforms send messages, discuss ideas, make complaints and review

which represent an important source of information for companies, politicians and researchers

who want to gain an understanding on how people react to different topics.

proposes the development of a tool which will automatizes the process of data extraction, data

sentiment analysis.

Sentiment is determined using two independent classifiers, a lexicon based classifier and a

machine learning classifier. This report also demonstrates how the lexicon classifier provides

the training corpus required for the machine learning algorithm. In addition, an interface will

be implemented to establish how the classification tools can be easily accessed by the user to

be able to create an analysis of the immigration on social media. From the experiments

undergone, the best results have been obtained using a machine learning classifier based on

Naïve Bayes algorithm and extracting only 20% of the features resulting in approximately

80% accuracy.

3

Acknowledgements

First of all, I would like to thank my supervisor Dr. Sophia Ananiadou who provided me with

great advice and materials along with answers to my unlimited number of questions for the

entire period of the project.

I also want to thank my family for financial and moral support during the journey through

university allowing me to be exposed to such a great learning environment and grasp a vast

amount of skills.

Lastly, I want to thank to all my friends who took the time to test (have participated in

me with feedback and suggestions.

4

Table of contents

Chapter 1 Context .............................................................................................................................. 8

1.1 Introduction ............................................................................................................................. 8

1.2 Sentiment Analysis in microblogging ........................................................................................ 8

1.3 Motivation ................................................................................................................................ 9

1.4 Project Outline ......................................................................................................................... 9

1.4 Related work ............................................................................................................................ 9

1.5 Report Structure ..................................................................................................................... 10

Chapter 2 Design .............................................................................................................................. 11

2.1 System Diagram ...................................................................................................................... 11

2.2 Project Lifecycle ...................................................................................................................... 13

2.3 Data Extraction ....................................................................................................................... 13

2.3.1 Motivation ........................................................................................................................ 13

2.3.2 Twitter connectivity .......................................................................................................... 13

2.4 Data pre-processing ................................................................................................................ 14

2.5 Sentiment Evaluation Methods .............................................................................................. 14

Chapter 3 Implementation ............................................................................................................... 14

3.1 Data Extraction ....................................................................................................................... 14

3.2 Text Normalization ................................................................................................................. 17

3.2.1 JSON removal .................................................................................................................... 17

3.2.2 Lower case transformation and Tokenization ................................................................... 17

3.2.3 Removing Usernames ....................................................................................................... 18

3.2.5 Removing URLs ................................................................................................................. 18

3.2.6 Removing duplicates and white spaces ............................................................................. 18

3.2.7 Possible implementations and approaches ....................................................................... 18

3.3 Sentiment Analysis ................................................................................................................. 19

3.3.1 Motivation ........................................................................................................................ 19

3.3.2 Lexicon Based classifier ..................................................................................................... 19

3.3.3 Machine Learning classifier ............................................................................................... 20

3.3.4 Feature Extraction ............................................................................................................. 21

5

3.3.5 Multinomial Naïve Bayes .................................................................................................. 21

3.3.6 Classification Process ........................................................................................................ 22

3.3.7 Combining two different classifiers ................................................................................... 23

3.3.8 Previous approaches and possible implementations ........................................................ 23

3.4 User Interface ......................................................................................................................... 24

3.4.1 Motivation ........................................................................................................................ 24

3.4.2 Django ............................................................................................................................... 24

3.4.3 User Interface Structure .................................................................................................... 24

Chapter 4 Evaluation and Testing ..................................................................................................... 28

4.1 Beta Testing ............................................................................................................................ 28

4.2 Unit Testing ............................................................................................................................ 29

4.3 Further Testing ....................................................................................................................... 29

4.4 Evaluation ............................................................................................................................... 29

Chapter 5 Reflection and Conclusion ................................................................................................ 30

5.1 Requirements ......................................................................................................................... 30

5.2 Learning process ..................................................................................................................... 30

5.3 Challenges .............................................................................................................................. 31

5.4 Future Enhancements ............................................................................................................. 31

5.4.1 Different classifier implementations ................................................................................. 32

References .................................................................................................................................... 33

Appendix ...................................................................................................................................... 36

6

Table of Figures

Figure (1) Diagram of the system ......................................................................................................... 12

Figure (2) The process of extracting the tweets .................................................................................. 16

Figure (3) The tweets collected in a file ............................................................................................... 16

Figure (4) A tweet in JSON format........................................................................................................ 17

Figure (5) One of the tests performed with the spell checker ............................................................ 19

Figure (6) The formula of Naïve Bayes algorithm ................................................................................ 22

Figure (7) The first part of the interface ............................................................................................. 24

Figure (8) The file upload ...................................................................................................................... 25

Figure (9) The ability of saving the chart ............................................................................................. 26

Figure (10) The second part of the interface ....................................................................................... 26

Figure (11) Result of the classification ................................................................................................. 27

Figure (12) The textbox ......................................................................................................................... 27

Figure (13) The TerMine service in highlighted words ........................................................................ 28

7

Acronyms

TM – Text Mining

NLP – Natural Language Processing

ML – Machine Learning

POS – Part Of Speech

CSV – Coma Separated Value

Regex – Regular expression

NLTK –Natural Language Toolkit

MVC – Model View Controller

SKLearn – Science Kit Learn

IE – Information Extraction

IR – Information Retrieval

DM – Data Mining

SVM – Support Vector Machines

8

1 Context

This chapter will present a general description of the text mining including the aims and

objectives of the my project and the applicability of a sentiment analysis tool. in Twitter

such tool in the modern days.

1.1 Introduction 1.1 I would rather start (will begin)with a brief

Text mining can be characterized as the process of analysing natural language text in order to

extract information that is useful for specific purposes. [1]

Text mining is achieved using three major sub-components:

Information Retrieval (IR) – Represents the process of gathering the data required

for analysis.

Information Extraction (IE) – Represents the process of structuring the data using

different Natural Language Processing techniques.

Data Mining (DM) – Represents the process of discovering hidden associations and

creating new knowledge.

Sentiment Analysis (also known as Opinion Mining) is a common topic among the TM

researchers and represents the process of categorising a text as positive, negative and neutral.

[61]

This project is using TM techniques in order to extract and analyse people’s opinions on

Twitter resulting in an overall analysis of the sentiment.

Twitter became (has become) a very important component in Natural Language processing

nowadays (delete nowadays) as they (it) provides an important amount of real-time

(plays) (in todays society)and allowing (opinions??). Twitter became (has

become) a very important component in Natural Language processing nowadays (delete

nowadays) as they (it) provides an important amount of real-time information which can be

Microblogging websites (as they stand today are) an unlimited source of information (for)

various topics. This came as a result of people using social media to post real time meassages

to present their opinions, make complaints, discuss different topics and post reviews of

products they use in (everyday) life. As a result of of this ongoing trend, manufactures are

social media to investigate customer’s reactions and opinions in order to improve their

products (as well as) maintain a high customer service. One particular challenge is

software which automatize the process of identifying a general opinion or sentiment. [2]

Nowadays microblogging websites have begun to be extremely popular providing users with

unlimited real time information and the possibility of expressing their opinions about specific

products they use or topics of interest along with making complaints and discussing recent

events. [1]

Formatat: Evidențiere

A comentat [SA1]: Remove this phrase



A comentat [SA2]: Use in the beginning Natural Language Processing (NLP) and Text Mining (TM)










9

1.43 Project Outline

The aim of the project is to analyse textual data extracted from social media, more

specifically, to observe trends related with (to) the vision of global immigration on Twitter

furthermore to determine the opinion of individuals on this particular topic.

(In recent years),Recently, I immigration (has become)is a controversial topic, discussed

the world. (Over time, and in the modern times it constitutes a controversial topic. mMore

increase in the number of) politicians (that) are touching (on) this ese subject in their

You need a link here: In order to understand trends and opinions about immigration in social

media, we need automated methods such as text mining and opinion mining. Such tools will

be approached (aand discussed further) in this report. (thesis?)ll….

We (I) shall demonstrate The resulting system is how the application integrates the Text

Mining (TM) tools with visual analyticsable to offering the user an extensive analysis

through the interface for the user offering the capability of performing an analysis gathering

visualising the collective opinions for a specific period of time and along with possibility

ofalso individual examination of the tweets analyse each tweet individually. (have a look at

the whole sentence again- not sure what youre saying)

Furthermore, to extend the capabilities of the application we(I) shall demonstrate how two

different sentiment classifiers are implemented . The system is also designed to allow the user

users to choose the desired classifier allowing him (them) to make a comparison of the results.

The application provides the user with a text box where (in which) different sentences and

reviews can be added, in order to determine the sentiment expressed by the authors. (Further

to this,) it The system can also makes a call to a web service named TerMine provided by

NacTem [33]. which is analysing the inputted text (Through this service, the inputted text is

analysed ) and (as a result it) highlights the most important words in the text. By using this

the user can easily determine the content of the text and decide if the text presents any interest

for his needs.

1.54 Related work

Previous approaches include hand-coded rules (Neviarouskaya et al., 2010), the winnow

algorithm (Alm et al., 2005), random k-label sets (Bhowmick et al., 2009), Support Vector

Machines (SVM) (Koppel and Schler, 2006), and Naive Bayes (Mihalcea and Liu, 2006).[4]

One of the projects which aims to make an analysis of product reviews is (Fang and Zhan,

2015). They propose a tool able to classify at sentence level studying individual sentences

and extract the polarity from it and review level which generates the overall sentiment for the

entire review. The system is using three different machine learning classifiers a Naïve Bayes

classifier a Random Forest classifier and a Support Vector Machine Classifier (SVM).

In addition to social media sentiment analysis is performed on numerous forums or domain

specific blogs. (Zhao et al., 2014) propose a tool which is analysing the Online Health

A comentat [SA7]: If you have visualisation then either you must say how you integrate text mining with visual analytics

A comentat [SA8]: Which classifier?


A comentat [SA9]: Avoid the overuse of the word ‘the system’. Say, we shall show or demonstrate

A comentat [SA10]: For doing what? Be specific.

10

Communities (OHC) and identifies the most influential users. In this way the results will

assist the community and the OHC users. They use a special built classifier able to categorise

words which may refer to both positive and negative content such as “cancer” and be able to

correctly categorise it as the term “cancer” usually relates to a negative sentiment.

As described in section 1.2 Twitter can be a valuable source of information regarding

different domains of interest. In addition Twitter contains various discussions when a political

campaign is undergoing which makes it a necessity to perform sentiment analysis in order to

understand better users’ opinions. (Bakliwal et al., 2013) describe various experiments

undergone on Irish General Election in February 2011. They use a naïve lexicon based

classifier which uses a sentiment lexicon to determine the sentiment orientation of political

tweets. After excluding the tweets containing sarcasm the highest accuracy obtained was

61.6%.

One of the papers focuses on researching the social media in regards to medicine. This is

performed in order to get a better understanding on how people react to some

medicationmedicaments, treatments and diagnosis. Further to this the relation between

specific treatments and possible symptoms can be studied in a higher detail allowing the drug

This chapter presented the reader with an introduction of Sentiment Analysis along with the

general purpose of Sentiment Analysis and related work undergone by different practitioner.

Chapter 2 Design:

The chapter will present the architecture of the project. In addition, different approaches

tested in order to achieve my results will also be included.

Chapter 3 Implementation:

This chapter presents a more detailed description of the steps followed alongside the tools

used for the project implementation.

Chapter 4 Testing and Evaluation:

This chapter contains the evaluation and tests carried out.

Chapter 5 Reflection and Conclusion:

This chapter presents the overall conclusion of the project along with any further

enhancements that can be carried out.

Chapter 2 – Design

The project was designed to allow an extensive analysis on a specific topic, based on

information extracted from social media. The system will be able to extract information from

Twitter, normalise this data, followed by determining the sentiment from the text and

categorising it in three groups: positive, negative and neutral. Furthermore, this chapter will

A comentat [SA11]: Which one?


A comentat [SA12]: You don’t react to diagnosis



A comentat [SA16]: Check out scientific writing In this section we present …. How sentiment analysis here is related with the section above?

11

present a high level view of the subcomponents of the system along with the functionality and

the purpose of them.

2.1 System diagram

The diagram below presents a high level design of the system, showcasing how the

components interact with each other.

12

Figure 1 – Diagram of the system.

2.2 Project lifecycle

In order to get started with the (begin the) project, the first step was to get (gather) an

understanding in (of) Sentiment Analysis (by) evaluating the necessary tools and resources

needed for development. Some of the essential resources I have used to get (gather) get an

Sentiment Analysis” (Pang and Lee) [4] and “Natural Language Processing with Python” [5].

These two (bBoth) offer examples of different approaches for Sentiment Analysis including

and code examples. Using these resources, a couple of different examples mentioned by the

authors were, in order to be able to understand how the components are interrelating with

each other (interrelated). At the start of the project, various software developments

approaches were also evaluated to be able to maximise the results whilst ensuring that the best

professional tool is delivered. Therefore, the project adopted few Agile practices, which

proved to be a great decision for the development after taking in consideration the numerous

approaches regarding sentiment classifiers encountered and the need of easily change the code

to comply with each of them. Moreover, if the system would not be developed in separate

independent modules the necessity of writing the code from scratch would have been

unavoidable. As part of the practices, an online task board was used, which assisted mainly

divide (in dividing ) the requirements in small parts and (in order to be able to observe the

progress). As a result of using this practice, better time management and a better task

management were obtained. Furthermore, the use of retrospectives provided the option of

looking back and checking how the development progressed after every iteration, concluding

in enhancing the development for the future iterations.

2.3 Data Extraction

2.3.1 Motivation

A major part in this analysis is represented by the data extraction which provides us with

the necessary information regarding the project’s topic, immigration. I It also needed to

(was also necessary to) identify which social media platform suits well the data extraction.

By having previous experience with the Twitter API and being aware of the reliability of

the Twitter Streaming services, along with the amount of information available on

Twitter, (this) played an important role in my decision regarding choosing (the chosen)

platform.

2.3.2 Twitter connectivity

13

In order to extract the data we need to connect to Twitter. Twitter is providing with a

Streaming API which is offering developers a “low latency access to Twitter’s global

stream of data” [6].

Twitter Streaming API is offering three different services such as:

Public streams - “Streams of the public data flowing through Twitter”

User streams - “Single-user streams, containing roughly all of the data corresponding

with a single user’s view of Twitter”

Site streams - “The multi-user version of user streams”[7]

In the system the public stream is used in order to retrieve all public data based on a specific

keyword. As a result of the call to Twitter Streaming API all the results containing the

specific keywords are retrieved and are saved in a designated file.

2.4 Data pre-processing

Extracting the sentiment from a tweet is not a trivial matter as the data found on

microblogging websites contains slang, abbreviations and Twitter specific symbols. The

processed tweet requires to be cleared from URL, @ mentions and other Twitter specific

symbols such as ‘#’ whilst maintaining the text of the hashtag as it can contain an important

reference to the sentiment of the tweet.

The pre-processing is done using different regexes, which are scanning through the tweets

removing the undesired data leaving the tweet clean and ready for analysis.

2.5 Sentiment evaluation methods

When it comes to There are different methods and techniques used for performing sentiment

them divide into two major categories:

Machine Learning approaches

Lexicon Approaches

This project focuses on using two distinct approaches in order to get a comparison of results

and also one can be used as training data provider for the other.

The first approach of the project is a lexicon approach which is based on a lexicon containing

words along with their pre-labelled sentiments.

The second approach is a machine learning approach which uses probabilistic in order to

determine the sentiment in a tweet. This process requires a pre-trained classifier.

Chapter 3 - Implementation


14

This chapter presents all the components of the system including a detailed step by step

description of approaches and tools used in the development.

3.1 Data Extraction

The first step in extracting the necessary data for the analysis is to decide on the source of the

extraction. This project uses Twitter as a source of information as it provides the user with

unlimited amount of data.

In order to obtain a comprehensive analysis, we need to extract a large amount of tweets

which makes it absolutely necessary for the automatization of data extraction. As mentioned

in the section 2.3.2 Twitter does provide an API which can allow the user to extract the data.

API refers to Application Program Interface which allows the user to connect with the API

provider in order to access certain services orf tools [8]. This project will make use of Twitter

Streaming API. In order to connect to the API we will require four different API Keys from

Twitter:

Consumer Key

Consumer Secret

Access Token

Access Token Secret

These four keys are used to issue authorised requests to the Twitter Streaming API. [9]

These API keys can be obtained by accessing “apps.twitter.com/” and creating a Twitter app

where Consumer Key and Consumer Secret will be available. In order to obtain the Access

Token and Access Token Secret, Twitter provides a function to generate these which can be

accessed from the Twitter app created.

Furthermore, the system uses a Python library called (known ass) Tweepy[10 ] which reduces

complexity in creating the connection to the Twitter API by handling the creation of the

session, connection to the Twitter Streaming API, authentication and also exiting the session.

Tweepy is also reading the incoming messages received from Twitter making it easy to

connect to the Twitter Streaming API. [10]

Inside the Stream Listener class of Tweepy there is a function called filter assigned to the

stream listener object which allows the user to define specific keywords which will filter the

tweets retrieval.

The project focuses specifically on immigration so the keyword provided for the stream

Listener is “immigration”. In this way only tweets containing this keyword or related to the

topic will be extracted. (Figure 2)

To add how much data is extracted in one day………………..

Figure 2 - The process of extracting the tweets.

15

Once the session is active the tweets are saved in a file from where they will be processed

later. (Figure 3)

Figure 3 - The tweets collected in a file.

3.2 Text normalisation

Twitter became a very important component in Natural Language processing nowadays as

they provide an important amount of real-time information which can be processed and used

in different ways in text Mining and Natural Language processing. The information present

on Twitter can range from formal reports to meaningless messages mostly written in a Twitter

specific language. [11] From my observation most of these tweets contain typos,

abbreviations, emoticons, URLs and other Twitter specific symbols and expressions. In order

to be applicable for Sentiment Analysis the tweets have to be normalised which implies

different “cleaning techniques” that will be presented in this chapter.

3.2.1 JSON Removal

Returning to the data we extracted from Twitter, it needs to be mentioned that the Twitter

Streaming API returns the tweets in JSON format which refers to stands for JavaScript Object

Notation.[12]

The tweet extracted contains various information about the authoruser [13] (Figure 4):

“created at”- date and time when the tweet was created

“id” – a unique value assigned to every user

“text” – the content of the tweet

“source” – the application from where the tweet was sent



16

“screen_name” – the name of the author

“location” – the location of the author

“description” – the author of the biography

“friends_count” – the number of friends of the specific user

“retweet_count” – how many times the tweet was retweeted

“lang” – the language of the tweet

Figure 4 - A tweet in JSON formatpresenting a raw tweet

Although the raw tweet contains a lot of information we will only need the text of the tweet.

In this case we will load the JSON and parse through the tweets file while writing on a

separate file only the text appearing in the raw tweet.

In a different scenario all the information contained in the raw tweet can also be handled in

different ways according to the requirements of the project. By this way the tweets can be

filtered by location, selecting different regions or languages.

3.2.2 Tokenization

Tokenization and lower case transformation are processes which imply transforming all the

text to lower case as well as dividing the character stream into different tokens.[14] These

processes are used for the lexical analysis undergone by the lexicon classifier and are

performed using SentLex Library.[15]

3.2.34 Removing usernames

One of the ways the information is spread over twitter is by users tagging another users in

their tweets and as a response most users are replying to the initial tweet using the same

tagging principle. The tagging process is done using the specific Twitter procedure such asby

adding “@” followed by the name of the user. In this way the tweet contains numerous user

tags that can affect the sentiment detection. The system is parsing through each tweet in a file

looking for “@” symbols and removing them as part of the normalization process.

3.2.54 Removing URLs

Similar to the user tagging mentioned above, the tweets usually contain links to different

attachments of websites, expressions which do not contain any sentiment and their removal

will enhance the precision for the lexicon classifier and increase the accuracy for the machine

learning classifier as it will be presented in the next chapter. As a result, the system uses

regexes to parse the tweets and identify the patterns leading to an URL and removing it.

Regexes referare standing tofor “Regular Expressions” and represent which are text strings

used for describing patterns.

17

3.2.6 5 Removing duplicates and white spaces

The Twitter Streaming API returns all the tweets containing the designatedrespective

keyword. and a Amongst these the Retweets are also are included Retweets which can lead to

a less efficient training data for the ML classifier. In order to avoid this,ese the system is

parsing a file and saves each line in a set, followed by writing the tweets in a separate file.

The set it is used as it is a data structure which contains only unique elements and as a result

of this the duplicates are removed.

3.2.7 6 Possible implementations and approaches

One of the text normalisation practice is the stop -words removal. Stop words are theose

words which appear commonly in sentences that do not present any sentiment. [16] Usually

these words are represented by linking words such as: “to, a, and, hHow, wWhen” . and

rRemoving them from the sentence would allow the system to focus on the most important

words which are linked to a sentiment.

The stop words removal has a wide range of applicability from:

Supervised Machine Learning

Clustering

Information Retrieval

Text summarization [17]

My attempt to implement a stop word removal was done using the NLTK Stop Word list but

has resulted in a lower accuracy than before removing the stop words. Most of the pre-

compiled stop word lists have a negative impact on the classification performance, also Naïve

Bayes classifiers are more sensitive to stop words removal than Maximum Entropy classifiers.

[18] This system attempted a Naïve Bayes classifier and a pre-compiled stop words list, and

resultedting in a low performance classification which was the cause of the exclusion of the

stop word removal process from the normalization.

As the introduction of this chapter mentions, Twitter contains slang, abbreviations and Twitter

specific expressions. In order to handle this I have attempted to use a spell-checker provided

by TextBlob [19]. The result as presented in the diagram below was presenting that the spell

checker corrected was correcting the extra letters in the words such as “Gooooood” into

“Good” but in some cases did not performedwas not performing as expected. (Figure 5)

Consequently, the sentence would result on a potential wrong classification and was not

includedused in the project. The spelling corrector is performing with an accuracy of 70% and

is implemented using a pattern library.[19] As a result of limited time for development a

different spelling corrector was not implemented.

Figure 5 - One of the tests performed r forwith the spell checker

18

3.3 Sentiment Analysis

In order to perform the desired classification this system is using two different approaches:

Lexicon based approach

Machine Learning approach

3.3.1 Motivation

The motivation for using two different classifiers was to gain an understanding in how

sentiment analysis can be performed using both lexicons based and machine learning

approaches along with getting a comparison of results in order to analyseis when and which of

the classifiers is obtaining a better performance.

3.3.2 Lexicon Based Approach

According to (Kaushik and Mishra, 2014) lLexicon based approaches for sentiment analysis

are based on the assumption that the overall sentiment of a piece of text is determined by the

sum of the individual words or phrases.[20]

In order to determine the sentiment for the tweets the approach makes use of lexicons.

Lexicons are dictionaries containing words along with their POS tag and polarity represented

by positive and negative scores.

POS tagging refers tostands for Part-Of-Speech Tagging and is a piece of software which

takes as an input a text and assigns part of speech tags such as noun, adverb, adjective, etc.

and it is described by The Stanford Natural Language Processing Group as a piece of software

In order to determine the sentiment in a piece of text the system is performing a POS Tagging

on the tweet, followed by the application of different algorithms to calculate the overall

polarity of the text. The system uses SentLex[15] a python library which performs lexicon

based analysis.

One of the algorithms used is a Negation Detection which is determining if the word is

negated, resulting in a better performance of the classifier.

In the context of this project the system was required to return a three way classification such

such (ddefined asby ) as positive, negative and neutral. Since SentLex[15] was providing with

and negative tags for the input tweets, the library had to be modified in such a way to return

the neutrality of the sentence. ConsequentlyFurthermore, an algorithm was implemented and

extracting the positive and negative score for each tweet followed by the calculation of the

absolute value of the difference of these two and compared with two pre-defined boundaries

which represent the values for neutrality.

Lower bound < absolute value (positive_score – negative_score) < Upper bound

In this way the neutrality could be determined.


19

3.3.3 Machine learning classifier

Sentiment analysis approaches can be divided in two categories:

Supervised Learning models

Unsupervised Learning models

Supervised learning refers to those models where the training data contains the input data

along with the wanted outcome. In addition they have to achieve acceptable result for data not

observed during the training.[22]

Unsupervised learning describes a model which is not provisioned with pre labelled inputs, so

the model has to make relations between the data provided or search for patterns. (Peter

Dayan) affirms that: “Unsupervised learning studies how systems can learn to represent

particular input patterns in a way that reflects the statistical structure of the overall collection

of input patterns”. [23]

The motivation for using a supervised approach for the machine learning classifier is that for

sentiment analysis supervised approached tend to have a higher performance with the

condition of test data being similar to the training data.[24]

(Schrauwen ,2010) In the paper “Machine learning Approaches to Sentiment Analysis Using

the DNC” the author defines Supervised Machine Learning as those techniques which use pre

labelled training corpus in order to learn a certain property or learn from examples [25]. In the

case of this project the system is provided with a predefined labelled corpus of 200000

positive tweets and 200000 negative tweets. This training corpusora was providedtaken from

by sentiment140 [26]. In order for this training corpus to be used by the system, it had to be

pre-processed the data to a format such as Tweet, Label. As the data was saved in a CSV file,

Microsoft Word was used to change the formatting.

Machine learning for sentiment analysis can be developed using different algorithms. The

most popular are [25]:

Naïve Bayes

Maximum Entropy

Support Vector Machines

Decision Threes

According to (Go et al., 2009) [26] it is not certain which of the algorithms mentioned above

perform better. The algorithm selected for this project is Naïve Bayes as it is simple to use

[27] and offers an accurate classification (Manning et al., 2008)[28].

The algorithmis is based on Bayes theorem and is efficient on a high input of data. The Naïve

Bayes model uses the maximum likelihood method and most of the times performs well in

difficult real situations in spite of the over simplified assumptions.[29]

3.3.4 Feature Extraction

A comentat [SA20]: Use proper citation format, e.g. in (author et al) or in [1] where 1 is the number of the citation in the reference

A comentat [SA21]: You have to explain there are supervised and unsupervised methods first and then explain why you use unsupervised

A comentat [SA22]: Read the introduction from our paper http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00452.x/full

20

The feature selection process is the automatic selection of the most important attributes from

the training corpus [30].

Research have shown that large feature representation can be effective for tasks in NLP. On

the other hand making use of an exceeding number of features does not always increase the

performance of the classifiers. The features can consist of unnecessary information which can

result in noisy feature presentations. [31]

In sentiment analysis and opinion mining, aspect extraction aims to extract entity aspects or

features on which opinions have been expressed (Hu and Liu, 2004; Liu, 2012).[5].

In this project SelectPercentile[32] and CountVectorizer[33] provided by SciKit Learn [34]

library SelectPercentile and CountVectorizer provided by SciKit Learn library act as a feature

selector, used in order to extract the 20% most important features of the training data.

Numerous tests have been performed with 10%, 20%, 30% and 40% of the features but the

highest accuracy have been obtained with extracting only 20% of the features.

3.3.5 Multinomial Naïve Bayes(Bayes (NB)

This system uses a multinomial Naïve Bayes classifier for sentiment analysis.

Figure 6 - The formula of Naïve Bayes algorithm [100]

The Naïve Bayes algorithm is based on maximum likelihood. In the formula presented above

𝒇 describes a feature while 𝒏𝒊(d) represents the feature count 𝒇𝒊. P refers to the probability

that an event takes place and P(c), P(𝒇 |c) are achieved using maximum likelihood

estimates.[100]

3.3.6 Classification Process

The process of determining the sentiment from a tweet does involve a few steps. As presented

in the subsection 2.5 the machine learning classifier is working (works) based on

the machine learning classifier to be able to categorise the tweets it needs to be trained as

presented in section 3.3.

After the classifier is trained it can be saved as a pickle object in order to be able to be reused

for later classifications.

A comentat [SA23]: Several papers on feature selection for TM http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00452.x/full#b51 http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00452.x/full#b4 http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00452.x/full#b48


A comentat [SA26]: Citation?

http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.2012.00452.x/full#b51




21

FAs for a better handling of training process, the python library used for development, Scikit

Learn, provides a pipeline which is capable of combining the vectoriser and percentile

Selector used for feature selection along with the desired algorithm for classification, in this

case Multinomial Naïve Bayes.

After the pipeline is created, training can be performed by providing the pipeline and the

training corpus.

After the training is done and the classifier saved as a pickle object it can be instantiated in a

separate class and used for classification.

As a result of this implementation, the classifier was capable of binary categorising of tweets

in two categories positive and negative.

As mentioned before in section 3.3.2 the requirements for the project were requesting a three3

way classifier capable of detecting neutrality in addition to positivity and negativity of tweets.

In this case the system is using a function provided by the Science Kit Learn library which

displays the probability of the specific piece of text to be negative of positive displayed in the

interface which will allow the user determine the neutrality of the text.

3.3.7 Combining two different classifiers.

A different approach for sentiment analysis this system is integrating was influenced by the

project developed by HP[35] and makes use of the lexicon based classifier mentioned in

section 3.3.2 as a provider of labelled data, necessary for training the classifier presented in

the section 3.3.3.

In order to do so the algorithm used for determining the neutrality for the lexicon based

classifier presented in section 3.3.2 has been used to determine the tweets with high negative

and positive sentiment. Afterwards aA set of 2000 pre-labelled tweets was created and was

added to the training corpus provided by Sentiment 140. This approach is extremely useful in

situations when a specific training corpus is not existent as in non-English speaking countries

where a pre labelled training corpus is not existent or a domain specific such as medicine

when a training corpus based on movie reviews would not produce a performant classifier.

This approach also enhances the performance of the machine learning classifier by providing

training data specific to immigration in the case of this project, resulting in a better

classification related to immigration. On the other side, this approach can also decrease the

performance of the machine learning classifier as usually the accuracy of a good lexicon

based classifier needed to produce the labelled data is around 70 – 80% resulting in 20% of

the data classified inappropriately, which will result in misclassification of the machine

learning classifier. In order to avoid this issue I decided to use only the tweets which have a

strong sentiment leading to a more accurate labelling. In order to extract only the tweets with

a high sentiment I used the algorithm presented in the section 3.3.2 and set it to a high value

so to extract only specific tweets.

22

3.3.8 Previous approaches and possible implementations

Initially I decided to use Naïve Bayes algorithm for developing the classifier which leaded to

future approaches using this algorithm. The motivation for choosing this algorithm is

presented in section 3.3.3Previous approaches to machine learning classifiers were implying

using different libraries such asmaking use of the Python’s library, Natural Language

Toolkit(NLTK) and TextBlob.

The first attempt was using the TextBlob library which was based on NLTK. The library was

using the Naïve Bayes algorithm for classification which was trained based on a corpus of

50000 tweets. This approach was not t considered forselected this system as a result of long

period of timeit was requireding long time f or to training the classifier.

The second attempt was using NLTK library. The motivation behind this approach was

eliminating the extra time required by the TextBlob library taken for training the classifier.

Furthermore this library was providing with a lot of functionality needed for the classification.

Using this approach the training time for a corpus containing 400000 tweets was over 15

hours without applying the feature selection. This made me realise how important the crucial

importance feature selection ishas not only for reducing the time required to trainof training

the classifier as well asbut also improving the accuracy of the classifier. This result leaded me

to the use of Science Kit Learn library for the ease of use by combining the feature selection

functions and the desired algorithm for classification under one pipeline.

OtherApart from the approaches mentioned other methods have been tried but were not

considered but were not included in the system due to the lack of support for the specific

implementations or poor results.

3.4 User Interface

3.4.1 Motivation

The motivation behind developing the user interface was the necessity of displaying the

results in a professional manner including the idea of creating specific analysis and visualising

tools required for an extensive analysis.

3.4.2 Django

In order to build the interface I have decided to use Django an open source framework as an

alternativeinstead of a classic Python GUI. By using this approach it allowed me to build a

more reliable web interface along with more possibility of design.

23

Django is a High-level web-framework written for Python apps based on Model View

Controller (MVC) design patterns providing a powerful and friendly template system for the

designers. [31]

3.4.3 UI Structure

The user interface is structured in two parts. The first part of the interface is used in providing

a long term analysis of the immigration tweets displaying a pie chart which is updated

automatically from a database. (Figure 6)

Figure 6 – The first part of the interface.To add diagram

The database is provided by Django and by default is SQLite. The database can be updated

from the user interface by the user. (Figure 8) This action also triggers another jQuery request

which performs sentiment analysis on the tweets added to database and updates the pie chart.

24

Figure 8 – The file upload.

The pie chart template is provided by Chartit [34] and offers the user the possibility to save a

pdf copy of the pie chart or print it. (Figure 9) This functionality is important part of the

analysis as the user can save records of the data added for future evaluations.

Figure 9 – The ability of saving the chart.

25

The second part of the user interface allows the user to explore more deeply the sentiment

analysis tools, by being offered the possibility of examining each tweets on its own and

perform sentiment analysis on it using the selected methods presented in sections 3.2.2 and

3.2.3. (Figure 10)

Figure 10 – The second part of the interface.

The user interface also provides a column chart displaying the polarity of the tweet in the case

of lexicon based approach and the probability of that tweet to be positive or negative. In this

way the userhe can determine the neutrality or the intensity of the sentiment as presented in

section 3.3.6.(Figure 11)

The column chart is provided by Chartit [32] and similarly to the pie chart allows the user to

save the chart in pdf format or print it for further analysis. (Figure 11)

26

Figure 11 – Result of the classification.

The user interface also allows the user to input text for analysis by displaying a text box

where the user can provide reviews or text extracted from different sources and be analysed

using both sentiment analysis methods. (Figure 12)

Figure 12 – The textbox.

To add diagram…..

Figure 13 – The TerMine service in highlighted words.

27

Chapter 4 - Evaluation and testing

In this chapter will be presented different attempts to verify the results of the classifier along

with different testing methods in order to make sure the system is behaving accordingly.

The system was exposed to different approaches regarding the text normalization and

sentiment analysis. Furthermore the system was developed in independent modules able to

run separate from other parts of the system. As a result of these one of the main concerns was

if all the components would be able to be integrated under one system without having any

clashes or errors which would lead to a delay in providing the final result or even a system

failure incapable of meeting the requirements. Furthermore the system proved to be a success

not only being able of implementing all the requirements but also adding extra functionality.

In addition being developed in independent modules the system was capable of quickly

replacing and integrating different components which needed to be substituted as a result of

poor results.

4.1 Beta testing

As part of the testing process, testing and verifying the entire system was an important part of

the development.

If the developers want to believe the software they implemented has a good User Experience

they will often do, they are not capable of seeing the real nature of their system. [34] As a

result of this I have decided to let people who were not involved in the project, use and test

the system in order to provide me feedback and ideas which will improve the quality of the

software and enhance the User Experience. This method proved to be extremely useful as I

have received feedback for different components of the system and ideas for creating a more

professional User Interface.

4.2 Unit Testing

In addition to the high level testing carried out for the system, low-level testing was

performed for the system which implied the creation of the individual unit tests for each

component of the system.

The tests have been implemented for the normalization and the classifiers using PyUnit tests.

PyUnit is a Python language version of the unit testing framework JUnit. Following the unit

tests I was able to reassure that the methods used by the system are behaving as required.

28

4.3 Further Testing

Once the interface was implemented, a continuous testing was approached, evaluating if the

classifiers are categorising accordingly. In this way, the interface was tested in more detail to

make sure everything behaves accordingly. In addition, this allowed me to observe if my

classifiers are consistent in the case of different categories of tweets. From my observation the

tweets on immigration vary from a political view to a social view, but the classification is not

affected by this change. Furthermore, I observed the lexicon classifier tends to perform less

accurate when tweets containing sarcasm are processed. This matter was expected, as the

lexicon classifier is based on a dictionary look up and does not contain any sarcasm detection

handlers. The machine learning classifier was able to classify appropriately some of the

tweets containing sarcasm, as it uses a probabilistic algorithm based on a pre labelled corpus

of data.

4.4 Evaluation

In order to evaluate the performance of the classifiers a data set of 1000 pre labelled tweets

was used. After several tests, the highest accuracy achieved with the lexicon classifier was of

63%.

Different tests have been performed in order to examine the performance of the machine

learning classifier. Different corpus size used for training have been approached along with

different percent in feature extraction as presented in section 3.2.4. The highest accuracy was

obtained using a training corpus of approximately 400000 tweets along with 20% of the

features extracted. The machine learning classifier obtained a maximum accuracy of 80%

based on a data set of 1000 pre-labelled tweets. Various studies demonstrated that humans can

usually agree to the sentiment in a text between 70% and 80% of the time. [99]

Chapter 5 - Reflection and Conclusion

5.1 Requirements

The application was required to be capable of:

Extracting data from Twitter

Performing pre-processing and normalization

29

Performing sentiment analysis

The requirements were implemented successfully and in addition a separate approach for

sentiment analysis was developed along with an interface to display the results.

5.2 Learning process

This project was a great opportunity for me to highlight and apply in practice the knowledge

gained during the first years of the university, but most importantly it was a great learning

process. For the period of the development I have been exposed to programming languages I

had never used before such as “R” as part of my attempts to implement a classifier. I had also

the chance to enhance my knowledge by learning to call web services and use advanced

functionality of Python language.

This project also introduced me to the area of text mining, natural language processing and

machine learning which helped me build on the previous experience gained during the

university.

By having developed the project in an Agile manner, using short iterations and an Online

Task Board, I had the opportunity to apply the materials taught during the university in a real

project leading to a better time management and a better modulation of my project allowing

me to change my code often without affecting the entire system.

I had also decided to create a user interface in order to display my results and where the user

can easily make use of the resources of the project without being exposed to non-user friendly

interface of the terminal where a specific computing knowledge is needed in order to use the

system. In order to do so I had to learn Django which is a fast web framework written in

Python [6] which allows the connection between the Python code of the system and a web

browser which acts as a user interface.

To conclude with the development of the project was a great learning experience allowing me

to enhance my knowledge and capabilities with technical skills by learning new programming

languages and advanced usability of previously known tools and also contributed to

increasing my soft skills by improving my time management and awareness of developing a

large project.

5.3 Challenges

By having no previous experience with Sentiment Analysis, one of the first challenges I

encountered was understanding the topic and the existing work.

The data extracted from social media consisted a challenge as well as the data is not written in

a formal manner respecting all the grammar laws, but it is written in a Twitter specific way

people making use of abbreviations, slang and Twitter specific language such as using

30

hashtags, @ mentions and URL’s. It was particularly challenging to adapt this data and

modify it in a way in order to be applicable for my classifiers.

Another major challenge was the implementation of the classifiers. Most of the related papers

were written using technical content specific to the topic and were including mathematical

formulas and topic specific expressions making it difficult to understand for a user without a

prior experience. Having carried an intensive research and reading on different forums and

speciality websites, where an introduction to the domain was presented in a less formal

context, I managed to grasp the basics and get an understanding of the acronyms and topic

specific expressions which lately allowed me to discover a more variety of papers and

projects related to this topic.

Having different attempts of implementing the classifiers using different Libraries required

getting used with using each of the libraries and accommodating their tools to my project but

carrying intensive research on specialist forums and having a background in computing I

managed to understand the way these tools are functioning allowing me to obtain the desired

results.

Another major challenge was implementing the web framework in order to construct the web

interface for my project as it required me understand how a “model-view-controller”

architecture works and how it can be deployed in my system. Constructing the interface also

required a combination of different tools and programming languages such as jQuery, HTML,

CSS which I had little previous experience followed by integrating all of them in order to

work with my system.

In conclusion developing this project involved facing a lot of challenges but being determined

to achieve my results and a lot of research helped me to manage to obtain the desired results.

5.4 Future enhancements

As mentioned in the previous chapter this project has managed to achieve all the desired

results and to implement all the requirements including additional features for a better

comparison of results and a better visualisation and handling of data. The area of sentiment

analysis is still in the early stages with newer implementations appearing very often.

According to my research there is no implementation which will guarantee you better results

than another, as in this area there are many components to take into consideration. The final

results can be influenced by the origin from where the data is extracted, topic on which is

performed the sentiment analysis and many other factors. In order to obtain the highest result

many approaches need to be implemented and verify which of them is providing the best

results. Furthermore this is what this project tried to achieve by implementing two different

approaches for the classification methods. However being limited by the time I was not able

to test every available approach. In this subchapter I will present some of the possible

enhancements that can be added to the project in order to provide more functionality for the

user, to enhance the accuracy of the classifiers and provide an even more accurate close to the

state of the art results.

31

5.4.1 Different classifier implementations

As mentioned in the section 3.3.3, different implementations for the machine learning

classifiers exists such as Support Vector Classification, Maximum Entropy. The authors of

“Using Maximum Entropy for Text Classification” mention that using Maximum Entropy in

text classification compared to Naive Bayes classifier which is used in this project can

sometimes provide better results but sometimes worse. [10] As a result the implementation of

different machine learning classifiers could have resulted in better or less accuracy than the

Naïve Bayes classifier used by this project.

References:

[1] Agarwal A., Xie B. ,Vovsha I. ,Rambow O. and Passonneau R. (2011) 'Sentiment Analysis of Twitter

Data', Proceedings of the Workshop on Languages in Social Media, pp. 30-

38.http://www.cs.columbia.edu/~julia/papers/Agarwaletal11.pdf

[2] Streaming APIs, Available at: https://dev.twitter.com/streaming/overview(Accessed: 5th April 2016).

32

[4] http://cs229.stanford.edu/proj2010/HsuSeeWu-

[10] Nigam K., Lafferty J. and McCallum A. (n.d.) 'Using Maximum Entropy for Text Classification'.

[7] Kerstin Denecke (2015) Health Web Science: Social Media Data for Healthcare, : Springer

International Publishing.

[8] Bhatti S. (n.d.) 'Multiclass Sentiment Analysis on Movie Reviews', pp. 1- 6.

[12] Pang B. and Lee L. (2008) 'Opinion mining and sentiment analysis', pp. 1 -90.

[13] Edward Loper (2009) Natural Language Processing With Python: Analyzing Text with the Natural

Language Toolkit, Sebastapol, California: O'Reilly.

[14] An Introduction to Text Mining using Twitter Streaming API and Python, Available

at:http://adilmoujahid.com/posts/2014/07/twitter-analytics/ (Accessed: 10th April 2016).

[15] Obtaining access tokens, Available at: https://dev.twitter.com/oauth/overview(Accessed: 10th April

2016).

[16] Streaming With Tweepy, Available

at:http://tweepy.readthedocs.org/en/v3.5.0/streaming_how_to.html (Accessed: 10th April 2016).

[17] Han B. and Baldwin T. (2011) 'Lexical Normalisation of Short Text Messages: Makn Sens a

#twitter', pp. 368-378.

[18] SON Tutorial, Available at: http://www.w3schools.com/json/ (Accessed: 11th April 2016).

[19] A beginners guide to streamed data from Twitter, Available

at:http://mike.teczno.com/notes/streaming-data-from-twitter.html (Accessed: 11th April 2016).

[20] All About Stop Words for Text Mining and Information Retrieval Available at: http://www.text-

analytics101.com/2014/10/all-about-stop-words-for-text-mining.html (Accessed: 11th April 2016).

[21] Saif H., Fernandez M. , He Y. and Alani H. (2014) 'On Stopwords, Filtering and Data Sparsity for

Sentiment Analysis of Twitter', pp. 810 - 816.

[22] Kaushik C. and Mishra A. (2014) 'A SCALABLE, LEXICON BASED TECHNIQUE FOR SENTIMENT

ANALYSIS', International Journal in Foundations of Computer Science & Technology (IJFCST), 4(5), pp. 35

- 43.

[23] Stanford Log-linear Part-Of-Speech Tagger, Available at:

http://nlp.stanford.edu/software/tagger.shtml (Accessed: 14th April 2016).

[24] Tools and Libraries for Lexicon-Based Sentiment Analysis, Available

at:https://github.com/bohana/sentlex (Accessed: 15th April 2016).

[25] Schrauwen S. (2010) 'MACHINE LEARNING APPROACHES TO SENTIMENT ANALYSIS USING

THE DUTCH NETLOG CORPUS'.

[26] For Academics, Available at: http://help.sentiment140.com/for-students (Accessed: 16th April 2016).

[28] Mihai M. (2010) 'Naive-Bayes Classification Algorithm'.

33

[29] http://machinelearningmastery.com/an-introduction-to-feature-

[31] Zhang Z. (2012) 'Django Web Framework '.

[32] Django Charit, Available at: http://chartit.shutupandship.com/ (Accessed: 18th April 2016).

[33] Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word

terms. International Journal of Digital Libraries 3(2), pp.117-132.

[34] Simon Harper (n.d.) UX from 30,000ft, : Leanpub.

[35] Zhang L., Ghosh R., Dekhil M., Hsu M. and Liu B. (2011) 'Combining Lexicon-based and Learning-

based Methods for Twitter Sentiment Analysis'.

[40]Tutorial: Quickstart, Available at:https://textblob.readthedocs.org/en/dev/quickstart.html#spelling-

correction (Accessed: 16th April 2016).

[50] Mu T., Miwa M. , Tsujii J. and Ananiadou S. (2014) 'Discovering Robust Embeddings In (Dis)similarity

Space For High-Dimensional Linguistic Features', Computational Intelligence, 30(2), pp. 285-315.

[51] sklearn.feature_selection.SelectPercentile, Available at: http://scikit-

learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html(Accessed: 17th April

2016).

[52] sklearn.feature_extraction.text.CountVectorizer, Available at: http://scikit-

learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html(Accessed: 17th

April 2016).

[53] scikit-learn, Available at: http://scikit-learn.org/stable/index.html (Accessed: 17th April 2016).

[55] Huang C., Simon P., Hsieh S. and Prevot L. (2007) 'Rethinking Chinese Word Segmentation:

Tokenization, Character Classification, or Wordbreak Identification', pp. 69-72.

[60] Witten I. 'Text mining', pp. 1-21.

[61] Sentiment Analysis, Available at: https://www.lexalytics.com/technology/sentiment(Accessed: 31st

April 2016).

[70] Gamallo P. and Garcia M. (2014) 'Citius: A Naive-Bayes Strategy for Sentiment Analysis on English

Tweets∗', Proceedings of the 8th International Workshop on Semantic Evaluation, (), pp. 171-175.

[100] Go A., Bhayani R. and Huang L. (n.d.) 'Twitter Sentiment Classification using Distant Supervision'.

[105] Donalek C. (2011) 'SupervisedandUnsupervised Learning',

[106] Dayan P. (n.d.) 'Unsupervised Learning', The MIT Encyclopedia of the Cognitive Sciences.,

[99] SENTIMENT ANALYSIS: WHY IT’S NEVER 100% ACCURATE, Available at:http://brnrd.me/sentiment-

analysis-never-accurate/ (Accessed: 31st April 2016).

[107] Rothfels J. and Tibshirani J. (2010) 'Unsupervised sentiment classification of English movie reviews

using automatic selection of positive and negative sentiment items',

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant

supervision. In CS224N Technical report. Standford

Chris Manning, Prabhakar Raghadvan, and Hinrich Schutze. 2008. ¨ Introduction to Information

Retrieval. Cambridge University Press, Cambridge, MA, USA.

34

Zhao K, Yen J, Greer G, et al. J Am Med Inform Assoc 2014;21: 212–218.

Bakliwal A., Foster J., van der Puil J., O’Brien R., Tounsi L. and Hughes M. (2013) 'Sentiment Analysis of

Political Tweets: Towards an Accurate Classifier', Proceedings of the Workshop on Language Analysis in

Social Media, (), pp. 49-58.

APPENDIX

University of Manchester logo is property of “The University Of Manchester (UoM) and has

been taken from: “http://www.manchester.ac.uk/”

All the external tools used in the development are under “Open Source” license and the author

does not hold property of the external tools used.

http://www.manchester.ac.uk/

35

TerMine service used by the project and described in section … are property of NaCTeM. For

more information please contact: [email protected] .

The report, screencast and the implementation of the project are properties of The University

Of Manchester.

For any enquires regarding the use of these resources please contact:

School Of Computer Science, The University Of Manchester.

mailto:[email protected]

twitter sentiment analysis on...

Documents