multilingual identi˝cation of o˙ensive content in social media1451543/fulltext01.pdf · features...

56
Linköpings universitet SE– Linköping + , www.liu.se Linköping University | Department of Computer and Information Science Master’s thesis, 30 ECTS | Computer Engineering 2020 | LIU-IDA/LITH-EX-A--20/053--SE Multilingual identication of oensive content in social media Marc Pàmies Massip Supervisor : Emily Öhman and Jörg Tiedemann Examiner : Marco Kuhlmann

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Engineering

2020 | LIU-IDA/LITH-EX-A--20/053--SE

Multilingual identification ofoffensive content in social mediaMarc Pàmies Massip

Supervisor : Emily Öhman and Jörg TiedemannExaminer : Marco Kuhlmann

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior förenskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användningav dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd somupphovsman i den omfattning som godsed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsidahttp://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to be mentioned when his/her work isaccessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page:http://www.ep.liu.se/.

©Marc Pàmies Massip

Abstract

In today’s society there is a large number of social media users that are free to express theiropinion on shared platforms. The socio-cultural differences between the people behindthose accounts (in terms of ethnicity, gender, sexual orientation, religion, politics, . . . ) giverise to an important percentage of online discussions that make use of offensive language,which often affects in a negative way the psychological well-being of the victims. In orderto address the problem, the endless stream of user-generated content engenders a needto find an accurate and scalable solution to detect offensive language using automatedmethods. This thesis explores different approaches to the offensiveness detection task fo-cusing on five different languages: Arabic, Danish, English, Greek and Turkish. The resultsobtained using Support Vector Machines (SVM), Convolutional Neural Networks (CNN)and the Bidirectional Encoder Representations from Transformers (BERT) are compared,achieving state-of-the-art results with some of the methods tested. The effect of the embed-dings used, the dataset size, the class imbalance percentage and the addition of sentimentfeatures are studied and analysed, as well as the cross-lingual capabilities of pre-trainedmultilingual models.

Acknowledgements

First of all, I want to express my sincere gratitude to Jörg Tiedemann for giving me the oppor-tunity to do this master’s thesis as a visiting student at the University of Helsinki. Moreover,the weekly feedback and advice received in the seminars from the Language TechnologyGroup was very helpful.

I am also especially grateful to my supervisor Emily Öhman for her unconditional supportthroughout the entire project. I really appreciate all the counsel and kindness received duringmy stay in Helsinki, as well as her predisposition to help in any aspect related to the thesis.

I would also like to thank my examiner from Linköping University, associate professor MarcoKuhlmann, for all the guidance received regarding the formalities of the project as well as theconstructive feedback from the thesis drafts.

Lastly, I would like to dedicate a few words to Timo Honkela, who sadly passed away duringthe development of this work. Timo is the reason why I came to Helsinki in the first place,since he happily agreed to collaborate with me despite his ill condition. The unfortunatecircumstances only allowed me to meet him briefly, but it was enough to realize that Timowas an extremely charming and truly inspiring person. May his soul rest in peace.

iv

Contents

1 Introduction 11.1 Problem definition and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Structure of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 42.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Text classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Text representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Word count vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 TF-IDF vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.3 Neural embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Models for text classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 Accuracy, precision and recall . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.3 Area Under the Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Related work 13

4 Method 164.1 Hardware and software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 OLID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.2 Class imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.3 Pre-processing steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Traditional machine learning approach . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Deep learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Transfer learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Results 285.1 Traditional machine learning approach . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Deep learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Transfer learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

v

6 Analysis and discussion 356.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Traditional machine learning approach . . . . . . . . . . . . . . . . . . . 356.1.2 Deep learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1.3 Transfer learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . 376.1.4 Comparison to state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2.1 Self-critical stance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.2.2 Replicability, reliability and validity . . . . . . . . . . . . . . . . . . . . . 406.2.3 Source criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Conclusion 437.1 Summary and critical reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 45

vi

1 Introduction

1.1 Problem definition and motivation

This thesis work addresses the problem of offensive language identification (e.g. detrimental,obscene and demeaning messages) in the microblogging sphere, with special focus on fivedifferent languages: Arabic, Danish, English, Greek and Turkish.

The number of social media users has reached 3.5 billion in 2020, and growth is not expectedto diminish in the following years1. In such an interconnected world, where an average of6,000 tweets are generated every second2, it seems inevitable that some users promote offen-sive language taking advantage of the anonymity provided by social media sites. Accordingto a study from 2014, 67% of social media users had been exposed to online hate and 21% ac-knowledged having been its target [48]. Consequences of repeated exposure to this materialinclude desensitization of verbal violence and an increment on outgroup prejudice [69].

The proliferation of hateful speech on the internet has not gone unnoticed by those offeringsocial networking services (SNS) [60]. Nowadays, any company hosting user-generated con-tent has the arduous task of penalizing the use of offensive language without compromisingthe users’ right to freedom of speech [70]. The usual approach is to forbid any form of hatespeech on their terms of service and censor inappropriate posts that have been reported byusers, but still companies like Facebook or Twitter have been criticized for not doing enough3.

The criticism received in recent years has forced SNS providers to take a more active roleas moderators, but the vast amount of data generated by online communities forces themto automate the task. The employment of human moderators is no longer an option sinceit is costly and highly time consuming. Moreover, the final outcome is inevitably subject tothe moderator’s notion of offensiveness, even if they have received proper guidelines andtraining beforehand [78]. In addition, the fact that hate speech spreads faster than regularspeech in online channels [41] implies that solutions that respond in a timely fashion arerequired.

1https://www.statista.com/topics/1164/social-networks/2https://www.internetlivestats.com/twitter-statistics/3www.spokesman.com/stories/2016/feb/26/zuckerberg-in-germany-no-place-for-hate-speech-on-/

1

1.2. Aim

All this generates a need to find an accurate and scalable solution that solves the problemof offensive language detection using automated methods. As a result, in recent years theNatural Language Processing community has become increasingly interested in this field.

1.2 Aim

The purpose of this thesis project is to investigate and evaluate different solutions to theproblem of offensive language detection and categorization on the microblogging site Twit-ter. The task will be framed as a supervised learning problem, and several configurations oftraditional machine learning (Support Vector Machine), deep learning (Convolutional Neu-ral Networks) and transfer learning (BERT) techniques will be tested. The goal is to evaluatethe importance of certain features, explore different ways of vectorizing text and experimentwith some of the most promising models from the literature.

The final output should be a stand-alone system able to identify offensive tweets with bothhigh precision and recall. The final implementation serves as a proof-of-concept.

1.3 Research questions

The experiments developed throughout this project aim to answer the following questions:

1. Can sentiment analysis boost the performance of an offensive language classifier?

It is safe to assume that there is a relation between emotion and offensive languagein social media, since offensive posts tend to present a negative polarity. A way tostudy this relation would be to analyse the impact of features that carry some type ofsentiment information, or alternatively to incorporate an additional step for polarityclassification.

2. Are subword-level approaches better than word-level approaches?

Modern NLP approaches use pre-trained word embeddings to capture the semanticsof text in a machine-friendly representation, but the choice of embeddings differs fromproblem to problem. The unorthodox writing style in online communities gives riseto many out-of-vocabulary words when the noisy text is tokenized into word units.This suggests that subword-level embeddings might perform better than word-levelembeddings since they are capable of handling spelling variations of words. A deeplearning model will be fed with different types of embeddings to get some insight aboutthis topic.

3. How good are multilingual neural language models at cross-lingual model transfer?

Nowadays there are several publicly available pre-trained models that claim to obtaingood results in a long list of languages. The multilingual corpus used for training givesrise to a single shared vocabulary that makes these models well suited for zero-shotlearning. By fine-tuning a multilingual model in a language other than the one used fortesting, it should be possible to gain some understanding about its capacity to general-ize information across languages.

4. Is the task of offensive language detection equally challenging in all languages?

The different structural properties of languages at a phonological, grammatical andlexical level may make offensive language easier to detect in some of them. Besides, thewide variety of tools at the disposal of high-resource languages can play against lesspopular ones. This work will compare the results obtained in five different languages:Arabic, Danish, English, Greek and Turkish.

2

1.4. Delimitations

1.4 Delimitations

A significant delimitation will be the fact that this work is exclusively focused on textualdata. Social media posts are quite often accompanied by multimodal information (e.g. im-ages, GIFs, videos, URL links. . . ) which can be crucial to fully understand the underlyingmessage, and therefore ignoring it might deteriorate the final performance of the classifier.Unfortunately, the processing of this type of data is beyond the scope of this project. Infor-mation about users (e.g. age, gender, demographics. . . ) will not be considered as it is oftenunreliable, even though some meta-information has been proven to be predictive in previouswork [13].

Moreover, the subjective biases of human annotators might introduce some noisy labels in thetraining data, since a same post might be considered offensive by some and non-offensive byothers. The lack of a standard and universal definition of the term ’offense’ adds ambiguityto the labelling task, and even with a clear definition sometimes some background informa-tion is required to correctly interpret the message (e.g. the usage of a word can be offensiveor not depending on the interlocutors relationship [34]). A study from 2016 highlights thedifferences between amateur and expert annotators when labelling a hate speech dataset,and found that the different labelling criteria are reflected in the final classification results[78]. However, we have no choice but to regard the labels from publicly available datasets asabsolute truth since it is beyond the scope of this thesis to annotate datasets. During the eval-uation process it might be interesting to pay special attention to those tweets that are oftenmisclassified to better understand the limitations of the system, which might be influencedby social biases in the form of noisy labels.

1.5 Structure of the work

The remainder of this work is organized as follows:

Chapter 2 introduces the reader to the topic of offensive language detection, providing all thetheoretical background required to fully understand the explanations from future chapters.

Chapter 3 presents a literature review of related work in order to show what has been at-tempted so far by others and the state of maturity of the field at the time of writing.

Then, the method is described in Chapter 4. All the data, pre-processing steps, feature ex-traction techniques and classifiers that have been used along the process are explained. Thisdetailed description of the work should allow the reader to replicate the experiments obtain-ing similar results.

The obtained results are reported in Chapter 5 and later analysed in Chapter 6. The latteralso includes a critical discussion of the methodology used and a final section discussing theethical and societal aspects related to the work.

The thesis is concluded by a summary and an outlook for possible future work in Chapter 7.

3

2 Theory

This chapter contains the theory of use for the intended study, introducing the reader to thetext classification task and covering the theory behind its main steps.

2.1 Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that copes withthe interaction between computers and humans using natural language. Its ultimate goalis to make computers understand the human language, which is not an easy task due tothe imprecise characteristics of our languages. It requires not only mastering the syntax oftext but also its semantics. Since linguistic rules are hard to hand-code, older rule-basedapproaches have been replaced by machine learning (ML) algorithms that are able to extractsuch rules from large amounts of data and derive meaning from them.

Among the applications of NLP are machine translators, spell checkers, virtual assistants,chatbots and interactive voice response applications. This project will focus on the sub-fieldof text classification, which is defined below.

2.1.1 Text classification

Text classification is a supervised machine learning task that amounts to automatically as-signing text documents to pre-defined categories based on their content [2]. Classificationcan be done into two categories (e.g. spam detection) or more (e.g. categorization of customerqueries by type) depending on the application. A text classifier requires a labelled datasetfor training, so that the underlying algorithm can learn from the labelled examples beforemaking accurate predictions. However, before feeding examples to the system, it is neces-sary to transform the raw text into a form that can be interpreted by the machine. After thispreliminary step, called feature extraction, the model can start the learning process aiming torecognize patterns that will be later used to make the classification decisions.

The problem of offensive language detection is nothing more than a specific application oftext classification, with some peculiarities that make it especially challenging [64].

4

2.2. Text representation

2.1.2 Sentiment analysis

Sentiment analysis, also known as opinion mining in academia, is an application of text clas-sification and one of the most active research areas in the field of computational linguisticsand NLP [39]. It is basically the automated process of analysing the sentiment that lies un-derneath a series of words in order to classify a text as either positive, negative or neutral,although a more fine-grained classification is also possible [57]. It is important to notice thedifference between sentiments, which are limited to a single dimension (polarity), and emo-tions that also capture the intensity offering a more detailed level of analysis [19].

This field of study has become very active in the last 20 years, mainly because, in general,the emotional content of text is an important part of language. In the case of social media,emojis are widely used to express feelings, which is why they are considered a meaningfulfeature for sentiment analysis tasks [31]. When detecting offensive language in Twitter, takinginto consideration the sentiment of tweets is supposed to improve the classification resultsbecause of the relation between offensive language and sentiment. Common sense suggeststhat offensive posts should carry a more negative sentiment as they tend to contain strongemotions like anger or frustration.

2.2 Text representation

Since machines cannot understand words as humans do, it is necessary to convert our naturallanguage to a machine-friendly representation. This section describes some of the most com-mon methods that can be used to convert plain text to machine-readable information. Theyall map symbolic representations to a vectorized form that can be fed to a neural network.

2.2.1 Word count vectors

Bag-of-words (BoW) is arguably the most basic method when it comes to converting textualdata to numeric representations, but still leads to statistically significant results. It consistsin generating a vector of word counts for each document (i.e. tweet) and storing them in amatrix where each column is a word and each row represents a document. The word countsstored in its cells are normalized before being fed to a neural network as features.

Among BoW’s limitations are the data sparsity problems in short texts, the high dimensional-ity of the encoded vectors and the fact that similar entities are not placed closer to each otherin the embedding space. Moreover, the ordering of words is ignored. Nonetheless, BoW isstill a good option to build an inexpensive baseline model. It is also suited for very specificcases, such as small datasets of highly domain-specific data where most of the words wouldnot appear in the dictionary of a pre-trained word embedding model.

2.2.2 TF-IDF vectors

TF-IDF, which stands for Term Frequency – Inverse Document Frequency, is another wayof converting text documents to matrix representations [63]. The TF-IDF model relies on asparse vector representation which is quite powerful despite its simplicity.

Term frequency (TF) represents the number of occurrences of a word in a document, and issimply computed as the number of times a term t appears in a document d (nt,d) dividedby the total number of terms in the document (N). So, the term frequency of a term t in adocument d (t f t,d) is calculated as:

t f t,d =nt,d

N

5

2.2. Text representation

On the other hand, Inverse document frequency (IDF) represents the importance of each termand is computed as follows:

id ft = logNnt

In this case, the numerator consists of the total number of documents (N) while the denom-inator has the number of documents that contain the term t (nt). The final TF-IDF value ofeach term and document is the result of combining the aforementioned formulas as follows:

t f id f t,d = t f t,d � id f t

So, instead of simply measuring the frequency of words as the BoW model does, in TF-IDFeach word is given a weight that somehow represents its importance to the document it be-longs to. This is done by not only counting the number of occurrences of the word in a singledocument but also in the entire corpus. In the formula above, the term t f t,d assigns highweights to terms that appear repeatedly in a document because they are supposed to be goodrepresentatives of it. In a similar way, id f t assigns low weights to words that are present inmany documents and high weights to words that appear in fewer documents. This is basedon the intuition that distinctive words that appear repeatedly in a limited number of docu-ments are the ones that better represent those documents, and therefore they should be givenmore importance.

2.2.3 Neural embeddings

Neural embeddings are one of the most popular NLP techniques nowadays. They provide away to capture the semantics of text in low-dimensional distributed representations, not onlyconsidering the words themselves but also their context and the relationships between them.This makes it possible to reflect a word’s meaning in its embedding.

By mapping these numerical representations of words into a vector space it is possible tovisualize how words with similar meaning (thus used in similar contexts) occupy close spatialpositions and vice versa. One of the main benefits of this is that a model can react naturallyto a previously unseen word if it has seen semantically similar words during training.

Despite being more memory-intensive, on many occasions it is worth using word embed-dings as they are more informative than a simple BoW or TF-IDF matrix. Moreover, the factthat they have been already pre-trained on a large corpus allows practitioners to directly usethe publicly available dictionaries, saving the time and resources that it would take to trainsuch a model from scratch.

Word2Vec [43] is a predictive embedding model that was published by Google researchers in2013, but is still in wide use to this day. The algorithm outputs a vector space given a corpusof textual data, capturing the semantic relationships in n-dimensional vectors. It is availablein two different architectures: Continuous Bag-of-words (CBOW) or Continuous Skip-gram.None of them takes into account the order of context words, which is one of Word2Vec’sgreatest weaknesses in comparison to other models that were later published.

GloVe [53] takes a similar approach to Word2Vec to generate dense vector representations, butinstead of extracting meaning using skip-gram or CBOW it is trained on global co-occurrencecounts of words. This is based on the idea that some words are more likely to occur alongsidecertain words, and thus it makes sense to consider the entire corpus when generating theembedding of a word. It allows GloVe to take global context into account, unlike Word2Vecwhich is exclusively focused on local context.

6

2.3. Models for text classification

Both word embeddings (Word2Vec and GloVe) achieve similar empirical results, and theysuffer from the same problems: the inability to deal with unknown words and the fact thatmulti-sense words are always encoded in the same way regardless of their context. The firstof those problems can be solved by splitting words into a bag of character n-grams to beencoded, as subword-level embeddings like fastText do.

fastText [28] is a model developed by the Facebook AI Research group that goes one stepfurther by taking into account the morphology of words, which enables the embeddings toencode sub-word information. Unlike word-level embeddings, in this case a word’s vectoris constructed based on its character n-grams, allowing fastText to handle previously unseenwords reasonably well. It is important to note that, even if it considers the internal structureof words by splitting them, at the end it still generates one single vector per word. It wasused to train 300-dimensional word vectors in 157 languages, which are publicly available atthe official website 1.

The common problem with all the aforementioned embeddings is that they are context-insensitive, which is a clear deficiency since words can have different connotations based ontheir surrounding words. In 2018, AllenAI introduced contextual word embeddings to theworld with ELMo [54]. The context-sensitive representations of these embeddings learnedfrom language models overcame the main limitations of the static embeddings presented be-fore. With ELMo, every vector assigned to a token is sentence-dependent, meaning that thelocal context is reflected in the instance embeddings by taking into consideration the entire in-put sentence for every word on it. As a result, even polysemous words (which represent over40% of the English dictionary [18] [36]) are represented by different vectors according to theircontext. This is translated in a performance boost for downstream tasks, which is why con-textual embeddings became so popular in recent years. As a matter of fact, popular modelslike Google’s BERT [15], OpenAI’s GPT-2 [59] or FacebookAI’s XLM [35] use contextualizedword embeddings to obtain more accurate representations.

It is important to notice that all the embeddings discussed in this section have different char-acteristics that make them more or less suited for a specific problem. Choosing the best em-bedding requires a try-and-fail approach as there is no one model that works always best.

2.3 Models for text classification

This subsection presents the theory behind the different models that will be used in the practi-cal part of this project, which includes Support Vector Machines (SVM), Convolutional NeuralNetworks (CNN) and the Bidirectional Encoder Representations from Transformers (BERT).

2.3.1 Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm that was firstintroduced in 1992 [7] and is still widely used today. It is mostly used for classification tasks,although it can also be applied in regression problems. In general, it offers higher accuracyand robustness than other classifiers like Naïve Bayes, logistic regression or decision trees.

The main idea behind SVM is to segregate a given dataset by finding a hyperplane that sep-arates all classes in the best possible way. To achieve that, the decision boundary must beas far as possible from any point in the labelled training set. Then, new data points are at-tributed to one class or another depending on which side of the hyperplane they belong to.This is illustrated in Figure 2.1, where the hyperplane is actually a line since the feature spacerepresented in the image has only two dimensions.

1https://fasttext.cc/docs/en/crawl-vectors.html

7

2.3. Models for text classification

Figure 2.1: Example of decision line in a two-dimensional space. Source: [4]

In order to find the optimal hyperplane, its perpendicular distance to the support vectors(those data points of each class that are closer to the frontier) must be maximized. This dis-tance is known as the margin, and is maximized by minimizing a hinge loss function in aniterative manner. For this, only the support vectors are taken into consideration, meaningthat the removal of other data points does not affect the outcome of the algorithm.

The maximum marginal hyperplane separates classes in a high-dimensional space of as manydimensions as input features. In the case of nonlinear input spaces, where classes cannot beseparated by a linear decision boundary, SVM uses the so called kernel trick to convert theinput space to a higher dimensional space where it is possible to accurately segregate thepoints using a simple line. Different types of kernels can be used to transform the featurespace. For textual data linear kernels are supposed to work best, but polynomial kernels andradial basis function (RBF) kernels are also available.

Some of the main advantages of SVM classifiers is their effectiveness in high-dimensionalspaces and their good performance when there is a clear margin of separation betweenclasses. Furthermore, they are memory-efficient as only a subset of the training points (thesupport vectors) is used in the decision phase. However, they require a long training time forlarge datasets and perform poorly when the target classes overlap.

A popular paper published in 1998 by T.Joachims studies the application of SVM for textcategorization [27]. The theoretical and empirical evidence provided by the author provesthat SVMs are a robust method to learn text classifiers from labelled examples.

2.3.2 Convolutional Neural Network

A Convolutional Neural Network (CNN or ConvNet) is a type of deep neural network thatwas originally built for image analysis but later found to be also quite effective for NLP prob-lems. As it can be deduced from its name, the main difference with regular multilayer per-ceptron networks (MLP) is that some of the CNN hidden layers (known as convolutionallayers) perform convolution operations instead of matrix multiplication. This is preciselywhat makes them so good at pattern recognition in images, since the convolutional filtersconvolve across the input pixels being able to detect edges, shapes, textures and even specificobjects in deeper layers of the network. In the case of text, the artificial neural network dealswith word embeddings instead of pixel matrices, which increases the size of the feature spacefrom three channels (case of RGB images) to as many as the length of the word embeddings.

8

2.3. Models for text classification

Figure 2.2 illustrates a particular example of how a CNN processes textual data. It gets asinput a matrix of embeddings and performs element-wise products with the elements fromdifferent filters that slide over the sentence matrix. The filters act as n-gram feature extrac-tors for embeddings, obtaining 2-grams, 3-grams and 4-grams in the toy example from below.Then the feature maps are pooled (i.e. 1-max pooling) and concatenated to form a feature vec-tor that is fed to a fully connected dense layer that performs classification. Typically sigmoidis used for binary classification and softmax for multi-class classification. It is also common touse non-linear activation functions like RELU or tanh before applying pooling techniques.

Figure 2.2: Toy example with a 7-token sentence and 5-dimensional embeddings. Source: [91]

One of the reasons why CNNs perform good at text classification is that their convolutionaland pooling layers allow them to detect salient features regardless of their position in theinput text [24]. Moreover, the non-linearity of the network and its ability to model localordering is supposed to lead to superior results. In the context of hate speech, Gambäckand Sikdar trained four CNN models to classify tweets as sexist, racist, both or neither [22].Their best model (78.3% F-score) employed CNNs with Word2Vec embeddings, but they alsoexperimented with models trained on character 4-grams, randomly generated word vectorsand a combination of word vectors and character n-grams. Other CNN-related works in thefield include [5] and [50].

2.3.3 BERT

The Bidirectional Encoder Representations from Transformers (BERT) is a deep pre-trainedlanguage model that was open-sourced by Google in late 2018 [15]. The bidirectional modelsoon became very popular due to its state-of-the-art results in several NLP downstream tasks,such as text classification, language inference, entity recognition, paraphrase detection, se-

9

2.3. Models for text classification

mantic similarity or question answering. These outstanding results were possible because,unlike prior models, BERT takes full advantage of the bidirectional information of text se-quences. This is illustrated by Figure 2.3, which shows how GPT connections only go fromleft to right while ELMo generates features by concatenation of left-to-right and right-to-leftLSTMs. In addition, word segmentation is performed using the WordPiece algorithm [65],which initialises a vocabulary with single characters and adds the most frequent combina-tions of symbols in an iterative manner.

Figure 2.3: BERT, GPT and ELMo pre-training model architectures. Source: [15]

The other reason for BERT’s success has to do with its novel training method. BERT waspre-trained using a vast amount of text data (Wikipedia and the Book corpus dataset [93]) ontwo language-based tasks:

• Masked Language Modelling: Fifteen percent of the words in the training corpora arerandomly masked and the network is trained to predict them.

• Next Sentence Prediction: The network is trained to predict if two given sentences arecoherent together. Consecutive sentences from the training corpora are used as positiveexamples and randomly selected sentences as negative examples.

After training, BERT can be either used as a high quality feature extractor keeping the learnedweights fixed, or alternatively be fine-tuned with a relatively small amount of task-specificdata. This is possible because the pre-trained weights already contain a lot of information,which reduces a lot the amount of data and training time required to obtain good results.

Depending on the training corpora, there are two different models that are worth mentioning:

• English BERT: Pre-trained exclusively on monolingual English data, giving place toan English-derived vocabulary. Google released another language-specific model forChinese, and similar models in other languages have been produced by third parties.

• Multilingual BERT: Pre-trained on monolingual corpora in 104 languages, giving placeto a single WordPiece vocabulary that allows the model to share embeddings acrosslanguages.

In terms of architecture, BERT is composed of a stack of transformer blocks that act as en-coders. It is based on the transformer network [75], where the so called transformers useattention mechanisms to assign different weights to parts of the input based on their signifi-cance. Attention makes the network focus on specific data points, which is especially usefulto learn about the context of words. This context is encoded in the WordPiece embeddingswhich are passed from one layer to the next capturing meaning at each stage. The numberof self-attention layers is 12 for the Base model and 24 for the Large model, and the length ofthe embeddings is 768 and 1024 for the Base and Large versions, respectively.

10

2.4. Evaluation Measures

2.4 Evaluation Measures

The aim of this section is to introduce some of the metrics that can be used to evaluate the ef-fectiveness of supervised classifiers. In general, all these metrics provide a performance scorethat simplifies model comparison and selection during training. Basically, the model that ob-tains a higher score on previously unseen data is considered to be the best at generalizing andthus should be selected for being the most reliable.

Since each metric evaluates different characteristics of the classifier, it is important to selectan appropriate one for a proper comparison of machine learning algorithms. Otherwise asuboptimal solution might be selected. In general, the values used to compute these measuresare obtained from the confusion matrix (see Table 2.1 below), which displays the correctnessof the model in a very intuitive way.

Actual Positive Class Actual Negative ClassPredicted Positive Class True Positive (TP) False Positive (FP)Predicted Negative Class False Negative (FN) True Negative (TN)

Table 2.1: Confusion Matrix

2.4.1 Accuracy, precision and recall

A very commonly used measure is accuracy, which provides the ratio of correct predictionsover the total number of examined cases:

Accuracy =TP + TN

TP + FP + TN + FN

However, its bias towards the majority class makes it only suitable for classification problemswhere the target variable classes are roughly balanced. This is not the case of offensive lan-guage detection, since despite the prevalence of offensive comments on social media theseare far from being the majority.

Another option is to use precision, which measures the fraction of predicted positive samplesthat are actually positive. It is a good choice for problems where it is important to keep thenumber of false positives down (e.g. spam filtering), and is computed as follows:

Precision =TP

TP + FP

On the other hand, recall measures the fraction of positive samples that are correctly clas-sified. This makes it appropriate for problems where it is important to capture as manypositives as possible (e.g. cancer detection). The formula for recall is displayed below:

Recall =TP

TP + TN

The problem is that in many applications, such as the one under study, the model is expectedto be both precise (high precision) and robust (high recall). In these cases it is necessary totake both measures into account, or a combination of them for a more intuitive interpretation.

11

2.4. Evaluation Measures

2.4.2 F1 score

The F1 score is a value between 0 and 1 computed as the harmonic mean of precision andrecall:

F� Measure =2 � precision � recall

precision + recall

Nonetheless, in some situations it might be interesting to assign different weights to preci-sion and recall based on domain knowledge. The reason is that in many problems precisionand recall are not equally important, and thus they should be weighted differently. The im-plications of each prediction error must be reflected in the cost of false negatives and falsepositives.

It is also important to notice that the formulas presented above are only applicable to binaryclassification problems. When there are more than two categories, it is possible to define per-class values for precision, recall and F1-score. Then, they can be combined in different waysto obtain the overall precision, recall and F1 scores:

• Macro-averaged score: Arithmetic mean of the per-class scores, giving equal weights toeach class. In other words, it does not take class imbalance into account.

• Weighted-average score: Weighted average of the per-class scores, giving more impor-tance to the over-represented classes. In this case the class imbalance is considered.

• Micro-averaged score: For precision and recall, it is computed using the regular formu-las but considering the total counts of true positives, false positives and false negatives.The micro-averaged F1-score is simply the harmonic mean of the micro-averaged preci-sion and the micro-averaged recall. The resulting values are biased by class frequency.

2.4.3 Area Under the Curve

In the literature there is also an assortment of publications that evaluate their systems withthe area under the receiver operating characteristic (ROC) curve, better known as AUROCor AUC in the general case. The ROC curve displays the performance of a model for eachclassification threshold. This is achieved by plotting sensitivity against specificity:

Sensitivity =TP

TP + FNSpeci f icity =

TNTN + FP

The AUC algorithm measures the area beneath the ROC curve providing a score between0 and 1 that represents the probability of the model ranking a randomly selected positiveexample higher than a randomly selected negative example. This means that the higher thebetter, similarly to what happens with the other measures described in this section. TheAUC is an effective measure for binary classification problems where both classes are equallyimportant. However, for problems with high class imbalance, it is better to compute thearea under the precision-recall curve (PRC) to give special focus to the minority class. Bothmeasures (ROC-AUC and PCR-AUC) are not as good for classifiers comparison as they arefor evaluation of a single classifier.

12

3 Related work

One of the first works related to the subject under study was published in 2009, where asupervised classification technique for harassment detection in chat rooms and discussionforums was proposed [85]. Their early experiments showed that the performance of a harass-ment classifier can be improved by adding sentiment and contextual features to the model,something that was utilised in some of the works that followed.

A few years later, Warner and Hirschberg published a paper where they provided their owndefinition of hate speech, with a special focus on anti-Semitism [77]. They realized that, de-pending on the target group, hate speech is characterized by different high-frequency stereo-typical words that can be used either in a positive or negative sense. This makes the hate-speech problem very similar to the one of Word Sense Disambiguation (WSD) [84], which iswhy the authors used WSD techniques to generate features such as the polarity of words. Af-ter collecting and labelling their own corpus, they trained a Support Vector Machine (SVM)classifier that achieved an accuracy of 94% and F1 score of 63.75%. The most indicative fea-tures of their model were single words.

The aforementioned work motivated the writing of another paper in 2013, which highlightedthe importance of context to identify anti-black racism [34]. According to the publication,86% of racist tweets were categorized as such simply because they contained some kind ofoffensive words. This is why Kwok and Wang decided to use a Bag-of-words (BoW) model,which was proven to be insufficient because unigram features alone are not able to capturethe relationship between words. This leads to a high number of misclassified tweets thatcontain terms that are likely to appear in racist posts, even if these words are not racist atall in many contexts (i.e. the words ’black’ or ’white’). Their Naïve Bayes classifier achievedan accuracy of 76% on a balanced dataset, a percentage that the authors believe could beimproved by including bi-grams, WSD and sentiment analysis to the algorithm.

At this early stage of the offensive language detection field, the use of predefined black-lists was a common approach [23]. The main problem of this old-fashioned approach is thattweets with offensive words are easily misclassified as hateful, which leads to a high false pos-itive rate caused by the prevalence of curse words on social media [76]. Moreover, list-basedmethods struggle to detect offensive posts that have no blacklisted terms. This proves that

13

this classification task requires some deeper understanding because even blacklisted termsmight not be offensive in the right context, as it was already noted by Warner and Hirschberg[77]. For instance, the word ’nigga’ can be used in a friendly way among African Americansand at the same time be considered offensive in other situations. Furthermore, lexical-basedmethods struggle to accurately detect obfuscated profanity (i.e. ’a$$hole’) as it is unattain-able to add all possible variants of slurs to the dictionary, which is why it is a good idea toincorporate edit distance metrics to the system [68].

In order to overcome the limitations of list-based methods, Chen et al. [10] used character n-grams in combination with other lexical and dependency parse features along with automat-ically derived blacklists. Their main contribution was providing the so called Lexical Syntac-tic Feature-based (LSF) language model implemented as a client-side application that filtersout inappropriate material to protect adolescent social media users. The proposed method,trained on Youtube comments from over 2M users, achieved a 96.25% F1 score in sentenceoffensive detection outperforming all the n-gram models that were used as baselines. Unlikemost contemporary works, their tool takes into account the author of the content (i.e. writingstyle, posting patterns. . . ) to identify not only offensive content but also potential offensiveusers. In the task of user-level detection, the LSF framework achieved a 77.85% F1 score.

Other researchers also realized the limitations of BoW-based representations, such as the factthat they do not take into account the syntax and semantics or the high dimensionality andlarge sparsity problem caused by obfuscations. In order to address these issues, Djuric et al.propose a paragraph2vec [37] approach that learns distributed low-dimensional representa-tions of user comments using neural language models [6]. Feeding this to a logistic regressionclassifier and evaluating it on the largest hate speech dataset available at the time, the resultsshowed that the proposed method was not only more efficient but also better than BoW mod-els in terms of AUC scores [16].

One year later, researchers from Yahoo Labs implemented a supervised classification methodthat obtained better results when evaluated on the exact same dataset [46]. Nobata et al. claimto have used a more sophisticated technique to learn the low-dimensional representation ofcomments as well as some additional features. They experimented with a wide range of NLPfeatures (n-grams, linguistic features, syntactic features and distributional semantics features)and after evaluating the impact of each individual feature they obtained promising resultsby combining all of them. However, token and character n-grams alone produced similarresults, even outperforming the ones from Djuric et al. [16]. Apart from that, the authorsmade available a dataset formed by thousands of comments from Yahoo! users that werelabelled as hate speech, derogatory language, profanity or none of them.

Other researchers also made the effort of labelling entire datasets and made them publicafterwards so that the community could have shared data to objectively compare their results.A good example is the work of Waseem and Hovy [80], who annotated an unbalanced corpusof over 16,000 tweets as either racist, sexist or normal. In addition to that, they provide adictionary with the most indicative words of the dataset and a bullet list for hate speechidentification that can be used by others to gather more data. They also studied the impact ofcombining character n-grams with extra-linguistic features and found out that gender is theonly demographic information that significantly improves performance. Other researchersalso took advantage of gender information to improve classification [13], being aware thatthis type of user-related information is often unreliable or even unavailable on social media.Most of the features used throughout the years for offensive language detection are includedin the survey carried out by Schmidt and Wiegand in 2017, where several state-of-the-artmodels were analysed with special focus on feature extraction [64].

14

In 2017 more modern approaches were introduced, where Badjatiya et al. were the first onesto use deep neural network architectures for hate speech detection [5]. Their proposed solu-tion outperformed existing methods by 18 F1 points when evaluated on the dataset providedby [80]. The authors trained several classifiers using task-specific embeddings learned usingCNNs, LSTMs and fastText [28], and obtained the best results when combining these em-beddings with Gradient Boosted Decision Trees. Interestingly enough, their best system wasrandomly initializing the embeddings instead of using GloVe pre-trained word embeddings[53].

Another line of research put special effort on differentiating between hate speech and otherinstances of offensive language, which are very often mixed in the literature. It is importantto discern them since the former type is considered a much more serious infraction that caneven have legal implications, and thus should not be confused with ordinary offensive posts.In 2017, Davidson et al. retrieved 24,802 tweets containing words compiled by Hatebase.org1

and labelled them as hate speech, offensive language or neither [14]. They soon realized thatthe Hatebase lexicon is not accurate enough since only 5% of the tweets were labelled ashate speech by their annotators, which is why the authors provide a reduced version that issupposed to have higher precision. Their multi-class classifier obtained an F1 score of 90%,proving that fine-grained labels are better for hate speech detection. However, the confusionmatrix showed that near 40% of hate speech was misclassified and that the model was biasedtowards the ’neither’ class. Their conclusions go further by stating that racist and homopho-bic tweets are more likely to be correctly classified as hate speech while sexist tweets are oftenclassified as offensive.

With regard to classifiers, some of the algorithms that can be found on the literature are Ran-dom Forest [8], Logistic Regression [14] and Support Vector Machine [40], as well as deeplearning approaches like Convolutional Neural Networks [22] or Convolutional-GRU [92].However, in 2018 the introduction of deep pre-trained language models like ELMo [54],ULMFiT [25], Open-GPT [59] and BERT [15] triggered a shift of the approaches taken in thefield. These novel models obtained state-of-the-art results in several NLP downstream tasks,text classification being one of them. In particular, BERT [15] stood above the rest for beingdeeply bidirectional and using the novel self-attention layers from the transformer model[75], which allows it to better interpret a word’s context. Moreover, it uses WordPiece em-beddings [65] instead of the common character or word based approaches, and it is trainedby a self-supervised objective. The bidirectional model can be conveniently fine-tuned with asmall amount of task-specific data and offer an excellent performance. The results publishedin the shared task OffensEval 2019 [88] proved that BERT is well suited for the offensive lan-guage detection task, since it yields successful results to the teams that used it. In fact, sixof the top-10 ranked teams in the offensive identification task used Google’s model for theirsubmissions.

Apart from Offenseval, which has already been held twice, other shared tasks like GermEval[81] and TRAC-1 [32] are worth mentioning. Also, workshops dealing with offensive lan-guage, such as TRAC [33], TA-COS [38] or ALW1 [79] have become more prevalent in recentyears.

As for languages, as it usually happens, most of the work that can be found on the literatureis focused on English. However, some researchers have also investigated less popular lan-guages such as Greek [51], Arabic [44], Slovene [21] and Chinese [72]. One of the reasons forthis shortage of non-English works might be that most of the publicly available datasets arecurrently in English [80] [46] [14], but this might change in the following years thanks to theemergence of multilingual shared tasks like OffensEval 2020 [89].

1https://hatebase.org/

15

4 Method

4.1 Hardware and software

All the implementations were performed on a 64-bit Windows 10 machine with 16GB of RAMand 2 CPU cores. However, the experiments were run either in GPUs from Google Colabora-tory1 or Puhti2, a supercomputer from the Finnish IT Center for Science (CSC). The utilizationof these high-level performance hardware reduced a lot the training time.

As for the software, all code was written in Python 3.6. This programming language offers awide range of open-source libraries for scientific computing (e.g. numpy), data manipulation(e.g. pandas), data visualization (e.g. matplotlib, seaborn) and natural language processing(e.g. nltk, gensim), among others.

Apart from the aforementioned libraries, the following were used in different situations:

• The implementations of Naïve Bayes, Support Vector Machine and Random Forest weredone with the popular ML library scikit-learn [52].

• The Convolutional Neural Network was implemented in Keras, a high-level library thatruns on top of TensorFlow [1].

• To fine-tune different BERT models, the transformers package from the huggingfacePyTorch library was used [82].

• BERT models further pre-trained on monolingual data in Arabic3, Danish4, Greek5 andTurkish6 were obtained from publicly available repositories.

1https://colab.research.google.com2https://docs.csc.fi/computing/system/3https://github.com/alisafaya/Arabic-BERT4https://github.com/botxo/nordic_bert5https://github.com/nlpaueb/greek-bert6https://github.com/stefan-it/turkish-bert

16

4.2. Data

4.2 Data

4.2.1 OLID

The main data used for this project is the so called OLID dataset, which stands for OffensiveLanguage Identification Dataset [87]. It was originally provided by the organizers of theOffensEval shared task [88] [89], which consists of the following sub-tasks:

A) Offensive Language Identification: whether a tweet is offensive or not.

B) Categorization of Offense Types: whether an offensive tweet is targeted or untargeted.

C) Offense Target Identification: whether a targeted offensive tweet is directed towards anindividual, a group or otherwise.

The different sub-tasks all shared the same dataset which was annotated according to a three-level hierarchical model, so that each sub-task could use as dataset a subset of the previoussub-task’s dataset. First, all tweets were labelled as either offensive (OFF) or not offensive(NOT). Then, for sub-task B, all the offensive tweets were labelled as targeted (TIN) or untar-geted insults (UNT). And finally, for the last sub-task, the third level of the hierarchy labelledtargeted insults based on who was the recipient of the offense: an individual (IND), a group(GRP) or a different kind of entity (OTH). To illustrate this, Table 4.1 shows the label distri-bution of the English dataset from the 2019 edition.

A B C Training Test Total

OFF TIN IND 2,407 100 2,507OFF TIN GRP 1,074 78 1,152OFF TIN OTH 395 35 430OFF UNT - 524 27 551NOT - - 8,840 620 9,460

All 13,240 860 14,100

Table 4.1: Distribution of label combinations in OLID.

All tweets were retrieved from the Twitter Search API and labelled through a crowdsourcingcampaign that followed the steps described in the dataset description paper [87]. As it isexplained in the paper, each tweet was manually labelled by at least two human annotators,adding a third in those cases where a majority vote was necessary to solve a disagreement.The criteria used by annotators was to label tweets according to the following definition forthe OFF label:

posts containing any form of non-acceptable language (profanity) or a targeted offense, which can beveiled or direct. This includes insults, threats, and posts containing profane language or swear words.

The corpus from OffensEval 2019 contains 14,100 English tweets (32.90% belonging to theOFF class), of which 13,240 originally belonged to the training set and the remaining 860 tothe test set. In 2020, unlike in the previous year, the labels of English tweets were generatedby unsupervised learning methods instead of human annotators [61]. Thanks to this it waspossible to collect over 9 million tweets, each of them associated to two values: the confidencethat a specific instance belongs to a specific class and its standard deviation. However, thisdataset will not be used for the main experiments since the unsupervised learning models

17

4.2. Data

used to generate it used tweets from the OLID dataset as part of the training data, whichmeans that using both sets may lead to overfitted results.

In addition to the English tweets, OffensEval 2020 provided a multilingual dataset for sub-task A composed of Arabic [45], Danish [67], Greek [56] and Turkish [11] tweets. Table 4.2shows the sizes and offensiveness percentage of the different datasets. Since the number ofinstances can vary a lot (i.e. the Danish dataset is roughly 10 times smaller than the Turkishdataset), some experiments will only use subsets of certain datasets so that it is possible toobjectively compare results. Otherwise the outcome may be conditioned by the amount oftraining data, making comparisons difficult when drawing conclusions.

Language NOT OFF Total %OFF

Danish 2,864 425 3,289 12.92%Arabic 8,009 1,991 10,000 19.91%Greek 7,559 2,728 10,287 26.52%Turkish 28,437 6,847 35,284 19.41%

Table 4.2: Multilingual datasets OffensEval 2020.

Even though the annotation guidelines from the task organizers were clear, it is assumed thatthere might be some noise in the gold standard. Table 4.3 shows a few examples extractedfrom the OLID dataset that could be questioned by some. Still, no corrections have beenperformed on the annotations as it is out of the scope of this project, and there is no point inconducting such a subjective task.

Tweet A B C

@USER Great news! Old moonbeam Just went into a coma! NOT NULL NULL@USER Yep Antifa are literally Hitler. NOT NULL NULL@USER Ouch! OFF UNT NULL@USER She is drinking so much koolaid she’s bloated. OFF TIN IND@USER @USER gun control! That is all these kids are asking for! OFF TIN OTH

Table 4.3: Controversial examples from the OLID dataset.

The literature overview from Chapter 2 showed that there are other publicly availabledatasets. However, being in the same language they do not allow performance comparisonsfor multilingual models, which is one of the interests of this work. It would be interesting toenrich the training data by combining several of those datasets, but one limitation might betheir classification criteria. For instance, some studies focus on abusive language [16] insteadof offensive language, others include specific types of hate speech like racism or sexism [80]and others differentiate between hate speech, derogatory language and profanity [46].

4.2.2 Class imbalance

The datasets used for offensive language detection tasks often suffer from class imbalance,meaning that their classes are not equally represented. This imbalance is intended to realis-tically represent the real-world content available on social networks, but at the same time itadds a level of difficulty to the classification task because machine learning algorithms aremuch likely to classify new observations to the majority class.

As an example, the OLID dataset is slightly imbalanced at the first level, more imbalanced inlevel two and highly imbalanced at the third. The reason is that most offenses are targetedand, when targeted, they are almost always directed at a group or individual. This means

18

4.2. Data

that classifiers will be reluctant to assign new examples to the OTH class, thus affecting verynegatively the final F1 score.

For a classifier to have good performance, it is necessary to address the class imbalance prob-lem or else less instances from the minority class will be correctly classified. There are severaltechniques that can be used:

• Obtain more instances from the poorly represented classes. This is not so simple sincethe reason why a class is poorly represented is precisely that it is not so common in thereal world.

• Delete instances from the majority class (under-sampling). This approach only makessense if we have a large dataset at our disposal, otherwise the remaining data couldnot be enough for proper training. On the other hand, if there is too much data thisapproach can solve memory problems and reduce the total runtime. However, thereis always the problem that useful information might be discarded (e.g. for rule-basedclassifiers).

• Add copies of instances from the minority class (over-sampling). Unlike under-sampling, this is usually done when there is a shortage of data and it is not affordableto lose training instances. The drawback in this case is that the algorithm is then morelikely to overfit.

• Generate synthetic samples using systematic algorithms like Synthetic Minority Over-sampling Technique (SMOTE) [9]. As its name indicates, SMOTE is an over-samplingtechnique that generates new synthetic data for the minority class. This is done consid-ering the K nearest neighbors of the minority instances and constructing feature spacevectors between them. The main advantage with respect to over-sampling by repetitionis that SMOTE is less prone to overfitting. On the other hand, it is not practical for highdimensional data and can introduce some noise caused by the overlapping of classeswhen performing kNN.

• Resample with ratios other than 1:1, since a certain degree of imbalance is acceptable.

• Ensemble several unbalanced datasets to give birth to a balanced one.

• Assign high weights to samples from under-represented classes and vice versa.

• Modification of the classification thresholds. This was used by the winners of sub-taskC in OffensEval 2019’s edition, as described in their system description paper [74].

• Use penalized models such as penalized-SVM or penalized-LDA, which penalize moreclassification mistakes on the minority class during training.

To keep things simple, this work will only use the over-sampling technique, as there is notenough data to perform under-sampling. However, it will not be done by default but onlyin those cases where it is seen that its use would bring some benefit. The participation inthis year’s OffensEval sub-task C, where the second position was achieved by over-samplingthe dataset, confirms that this is a reliable technique for the problem at hand [49]. Otherapproaches like resampling at different ratios or modifying the classification thresholds weretested as well, leading to similar results.

It is important to notice that cross-validation should not be applied after over-sampling, be-cause it would produce overly optimistic results that do not generalize to new data.

19

4.2. Data

4.2.3 Pre-processing steps

In any machine learning project, the first step after collecting the dataset is to explore it inorder to detect what needs to be corrected (e.g. inconsistencies, missing information, out-of-range values. . . ). Then, a series of transformations are applied to the data to produce amore reliable training set. This initial step is crucial since bad quality data will surely leadto bad quality results, and it is especially important when dealing with microblog contentwhich tends to be unstructured and noisy. This project experimented with the followingnoise removal and normalization steps:

DesensitizationIn the original datasets, every web address is replaced by the general token ’URL’. Also alluser mentions, which always start with the at symbol (@) in Twitter, were replaced by thegeneral token ’@USER’ to respect users’ anonymity. However, in some cases it had to bemodified to simply ’user’ so that it can be recognized as a single token. For instance, thetokenizer of the BERT-Base Multilingual Cased model (which does not lowercase the inputtext) would split the word ’@USER’ into three separate tokens: ’@’, ’US’ and ’##ER’.

Hashtag segmentationHashtags are known to be an integral part of Twitter. They are used as labels to group tweetsabout a same topic, so that users can easily find content about a subject of interest. Theproblem from an NLP point of view is that they are usually a sentence with no spaces betweenwords, which makes them hard to tokenize correctly. This is why, as an initial pre-processingstep, hashtags will be divided into recognizable words.

The fact that they always start with the hash symbol (#) makes hashtags easy to detect withregular expressions. Then, in order to know where to split, we assume that every new wordin a hashtag starts with a capital letter since it is how they are commonly used (e.g. #thi-sIsAnExample). Regular expressions are used again to insert a blank space every time that anon-capital letter is followed by a capital letter. By doing this and removing the hash symbol,most hashtags are correctly converted to actual sentences. However, since there is no stan-dardized way of spelling hashtags, it is conceivable that some of them might be wrongly splitby this method (i.e. #nocapitallettersatall or #ONLYCAPITALS). There are open-sourced seg-mentation modules7 that would fix this problem but they are not available in all languageson which this work focuses.

TokenizationTokenization is the task of splitting a sequence of words into individual units, or splittingtweets into words for this particular case. These words are referred as ’tokens’, and are de-fined by StanfordNLP8 as instances of a sequence of characters that are grouped together asa useful semantic unit for processing. Since our dataset is made up exclusively of tweets, itseemed appropriate to use the TweetTokenizer module from the NLTK library9. Unlike reg-ular tokenizers, this particular one is able to detect special tokens such as ’:)’ or ’->’ as well asseparating consecutive emojis into separate tokens.

In the case of BERT, tokenization was done with its own built-in WordPiece tokenizer thatsplits words into smaller sub-word units. This is why when using BERT it is preferable toskip the previous pre-processing step (hashtag segmentation), as BERT’s tokenizer breaksout-of-vocabulary words into the largest possible sub-words contained in the vocabulary. In

7https://github.com/grantjenks/python-wordsegment8https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html9https://www.nltk.org/api/nltk.tokenize.html

20

4.2. Data

a worst case scenario the word will be split into individual characters, which is quite unlikelyhowever, considering the size of BERT’s training corpus.

LowercasingAll characters are lowercased to ensure that variations of the same word are not associated todifferent embeddings. Also, it is necessary for methods that require feature extraction so thattokens can match blacklisted terms regardless of how they have been written.

Reducing lengthened wordsRepetition of characters is a common way for social media users to express emotions suchas excitement. In order to reduce the amount of out-of-vocabulary words it is important tocorrect these intentional spelling mistakes, which can be done again with regular expressions.The approach is to find all groups of three or more repeated character sequences (e.g. ’!!!!!!!’)and reduce their length to just two characters. The outcome may still not be grammaticallycorrect (e.g. ’amaaaaaazing’ would be translated to ’amaazing’), but this could be easily fixby a spelling correction algorithm. The problem is that such tool is easy to find for Englishbut not for all the languages of interest.

Stopword removalStopwords are the most commonly used words in a language (i.e. ’the’ or ’is’ for English). Itis a common practice to filter them out since the text’s semantics should remain intact afterremoving such words. It is important to note that in tasks such as offense target identification,it might not be a good idea to remove stopwords since they can carry valuable information.This project used a list of stopwords from the NLTK library, which has stoplists available forseveral languages.

Stemming or LemmatizationThese are both ways of identifying a canonical representative for a set of related word forms,so any of them can be used for that purpose. Stemming consists in stripping a word of itsprefixes and suffixes, aiming to reduce inflectional forms of a word to a common base form,so that related words are mapped to the same stem. Lemmatization does the same but doinga morphological analysis of words, which makes it more accurate but slower to run. Thereis no stemming or lemmatization tool that supports all languages, so a different one had tobe found for each language. For example, in Danish stemming was done with the Snowball-Stemmer [58] from NLTK (which also supports English, among others) and lemmatizationwith the Lemmy lemmatizer from Spacy.

Emojis removalThis is not mandatory but makes sense in some cases. For example, when using BERT thereis no point in feeding it emojis since they are considered out-of-vocabulary words, and there-fore they are associated to an ’UNK’ token. In any case, as it is obvious that emojis play animportant role in a tweet’s semantics, their hidden emotion should be captured in some way,hopefully as a meaningful feature.

It is important to notice that the order in which these steps are applied does matter. Forexample, if lowercasing is done before hashtag segmentation, the latter would have no effect.

It is believed that spelling correction algorithms would be useful in this phase, but they arenot so easy to find in languages other than English. There are even resources for contractionexpansion (e.g. ’don’t’ Ñ ’do not’) or irregular words correction (e.g. ’bro’ Ñ ’brother’). Allthese tools can help normalize noisy tweets, although a certain amount of noise might behelpful for abuse detection if captured in the feature extraction phase, which will be explainedin the following section.

21

4.3. Traditional machine learning approach

4.3 Traditional machine learning approach

Despite not being the most popular approach nowadays, several machine learning algo-rithms have been used for similar tasks in the literature. This part of the thesis will focuson Support Vector Machine (SVM), but Naïve Bayes and Random Forest will also be testedas baselines.

For the multinomial Naïve Bayes, an additive smoothing parameter of 0.01 is used and theclass prior probabilities are learned from the data. The SVM classifier has a linear kernel,a regularization parameter of 2.25 and squared l2 penalty. Word unigrams, bigrams andtrigrams were used as features. In the case of Random Forest only unigrams are consideredas this achieved slightly better cross-validated results when doing grid search. The numberof trees in the forest is set to 200 and no maximum depth is specified. Gini index is used tomeasure the quality of each split.

Different combinations of pre-processing steps and vectorization were tested. At the end,tweets are vectorized with TF-IDF and all pre-processing steps explained in Section 4.2.3 areapplied to the data:

• User mentions and web addresses are replaced by general tokens.

• Hashtags are converted to sentences.

• All characters are lowercased.

• Text is tokenized with NLTK’s TweetTokenizer.

• Repeated characters are removed.

• NLTK stopword lists are used to remove the most common words.

• Different stemmers were used for each language, as stemming produced very similarresults to lemmatization.

The main limitation of these methods is that they inevitably require some domain expertiseto be applied in the so-called feature engineering process. Feature engineering implies usingdomain knowledge to extract those attributes from the data that can help a machine learningalgorithm make better predictions. The feature selection and extraction steps come right aftercleaning and pre-processing the data, and before feeding it to the model.

Surface-level features like character or token n-grams are highly predictive on their own, buttend to be used in combination with other features for improved performance. In general, itcan be said that bigrams and trigrams are better than unigrams because they take into accountthe context information of nearby words. The three of them (1-grams, 2-grams and 3-grams)will be used when converting tweets into a numerical format with the TF-IDF model.

Then, in order to answer research question 1, sentiment analysis information should be em-bedded in another kind of features. For instance, a sentiment lexicon could be used to countthe number of positive, negative and neutral words in a post and use such information asfeatures. It can also help to somehow numerically represent the sentiment hidden in emo-jis and use that knowledge as a feature, since they are capable of completely changing theessence of a tweet. This will be attempted using the so called Emoji Sentiment Ranking [47],a language-independent lexicon that provides sentiment scores for the most popular emojison the internet. A few examples can be seen in Figure 4.1.

22

4.3. Traditional machine learning approach

Figure 4.1: First rows of the Emoji Sentiment Ranking.

The aforementioned emoji lexicon will be used to provide some extra information to the sys-tem (in the form of numeric features) for those tweets that contain at least one emoji. Thiswork will also experiment with the VADER sentiment lexicon [26] and the DeepMoji model[20], in order to quantify how useful can sentiment be as a feature for the detection of offen-sive tweets.

VADER, which stands for Valence Aware Dictionary for sEntiment Reasoning, is a rule-basedmodel for sentiment analysis that performs particularly well in the microblogging domain.According to the official paper, the model’s efficiency was comparable to the one of individualhuman raters at matching ground truth, even outperforming them in terms of classificationaccuracy. It was chosen for being more sensible than other highly regarded sentiment lexiconsin social media contexts.

DeepMoji is a deep learning model developed at MIT to learn richer representations of emo-tional content in text than the typical binary sentiment analysis (either positive or negative).It was trained on over a billion of tweets to predict their emojis, and as a result the modelachieved state-of-the-art results in emotion, sentiment and sarcasm detection. Notice in Fig-ure 4.2 how it is able to capture the positiveness of the last sentence (the slang ’This is theshit.’) and how well it notices the different usages of the word ’love’. The project is open-sourced 10 and has an online demo in their website 11, where anyone is welcomed to con-tribute by teaching the AI about emotions.

Given an input sentence (i.e. a tweet) the model returns a list of the top five most likelyemojis with their respective probability estimates. To compute a final score that can be usedas a feature, each of the five emojis is converted to a float to quantify their emotional content.This is done using the aforementioned Emoji Sentiment Ranking. Then, the weighted averageof each emoji’s sentiment gives the final score for the tweet.

As a final step, a congregation of lexical features will be considered as well. Even thoughlexical features are known to not perform so well on their own, they are commonly used inaddition to a feature set. Simple features that have been widely used in the literature includecounts of capital letters, non-alphanumeric characters, user mentions, punctuation marks,average word length and the total length of the tweet. Another feature that could boost theclassification performance is the presence of domain-specific terms, which in this particular

10https://github.com/bfelbo/DeepMoji11https://deepmoji.mit.edu/

23

4.4. Deep learning approach

Figure 4.2: DeepMoji examples. Source: [20]

case would be profanity words. A blacklist composed of 2,644 English words will be used forthis purpose 12.

4.4 Deep learning approach

Most of the deep learning approaches present in the literature use either Convolutional Neu-ral Networks (CNN), Recurrent Neural Networks (RNN) or Long-Short Term Memory net-works (LSTM), quite often in their bidirectional form (BiLSTM). All of them have shownpromising results, without any of them clearly standing out above the rest. For this work, aCNN was implemented, aiming to improve the results obtained by the traditional machinelearning approach. It also allows to compare different types of embeddings, which mightprovide some insight in relation to the second research question of the project (question 2).

The system described in this section is a multi-branch 1D-CNN classifier with pre-trainedembeddings. The proposed architecture is similar to the one presented by Y. Kim for the taskof sentence level classification [29], with slight variations. Although other works use deepnetworks made up of many convolutional layers [90] [12], the paper proves that a simplemodel with just one convolutional layer can achieve state-of-the-art results in tasks such asquestion or sentiment classification after little hyperparameter tuning.

The CNN models from [29] are built on top of Word2Vec. However, this project experi-ments also with fastText embeddings, aiming to get some insight about what type of low-dimensional representations are better for the offensive detection task: word-based or at sub-word level. In neither case further task-specific tuning of the embeddings was done, as Kimmentions in his paper that it provides little performance improvement [29].

Figure 4.3 shows the architecture of the convolutional network in an intuitive way. Firstof all, the input layer provides the pre-processed tweets that have been padded so that allinputs have the same length. The maximum sequence length was set to 200. Longer tweetsare simply truncated, but those are not too many considering that the maximum length of atweet is 280 characters.

Secondly, for each tweet that is fed to the system, the embedding layer generates a matrix withthe word vectors as rows. This means that the dimension of the output matrix is 200 rows(padded sequence length) and as many columns as necessary to fit the embedding vectors(i.e. 300 for Word2Vec and 100 for fastText).

12http://metadataconsulting.blogspot.com/2018/09/Google-Facebook-Office-365\-Dark-Souls-Bad-Offensive-Profanity-key-word-List-2648-words.html

24

4.5. Transfer learning approach

Figure 4.3: Toy example extracted from [29]

Then, three filters are applied in a 1D convolution step. All filters must have the exact samewidth as the embedding matrix (100 or 300), and different heights in order to capture infor-mation from a variety of n-grams. Tri-gram, four-gram and five-gram filters were used sincethis combination obtained better experimental results than other sets of sizes (i.e. [2,3,4] or[2,3,4,5,6]). For each filter size, 200 filters were used.

Next, the generated feature maps are forwarded to a 1-MaxPooling layer (generally betterthan other types of pooling for document classification [91]) which performs dimensionalityreduction by extracting the largest value among the features obtained from each filter. Theresults from all three parallel layers are then concatenated and flattened to produce a single 1-dimensional vector that represents the input tweet, and whose length is equal to the numberof filters.

Finally, after the convolution layer, there is a fully-connected hidden layer with dropout rateof 0.5 preceding another fully-connected layer (the output layer) that is responsible of the pre-dictions. The final layer, which has a single output node, uses a softmax activation function.

4.5 Transfer learning approach

In the last two years, deep contextualized pre-trained language models have revolutionizedthe field achieving state-of-the-art results in several NLP tasks. Sequence classification beingone of those tasks, it was a must to experiment with them in this project. The transformersmodule from the Hugging Face PyTorch library was used to fine-tune different pre-trainedBERT models for the classification task. Even though the library offers a wide range of noveltransformer models, our main interest was on BERT.

The first step right after loading and preprocessing the data is to convert it to the format thatBERT expects as input. First of all, the special tokens CLS and SEP have to be respectivelyprepended and appended to each tweet. The CLS token is very important as it carries thesentence embeddings generated by each transformer layer, and the classifier on top of thestack is exclusively fed with the information contained in the CLS token from the last trans-former (see Figure 4.4). On the other hand, the SEP token simply indicates the end of an inputsentence, which was established for two-sentence tasks but is still required in other tasks likesentence classification.

25

4.5. Transfer learning approach

Figure 4.4: [15]

Next, all tweets are padded (with the special PAD token) to a maximum sequence lengththat is defined after analysing the length distribution of the dataset. Longer tweets mustbe truncated. As a limitation, the maximum sequence length that BERT can handle is 512tokens. However, it is worth using a lower value if possible in order to significantly reduce thetraining time. An attention mask must be generated as well so that the model can differentiatebetween those tokens that were originally in the sentence and the ones that were added forpadding. The mask is basically an array (later converted to tensor) that contains zeros in PADtokens’ positions and ones elsewhere.

Then the built-in tokenizer is used to split the sentence into recognizable tokens that arelater mapped into their corresponding indices in BERT’s vocabulary. It is important not touse third party’s tokenizers because they would generate out-of-vocabulary tokens, whileBERT’s tokenizer will split all unrecognised words into sub-word units that are present inits vocabulary. Finally, all the encoded sentences are stored into a list that is converted toa PyTorch Tensor so that operations on it can run on Graphic Processing Units (GPU). Thedimensions of such tensors are always the same: the batch size determines the width and themaximum sequence length determines the height.

At this point the data is ready for fine-tuning. The sentence classification task requires aBERT model with a classifier on its top layer, which is why Hugging Face’s BertForSequence-Classification is a good choice 13. All results presented in Chapter 5 are obtained using thesmall version called Base-BERT, which in contrast to Large-BERT has twelve layers instead oftwenty-four. Although the results presented in the official paper [15] were obtained using theLarge model, it was discarded for requiring more computational power and longer runtime.

Before training, the loss function needs to be defined (i.e. binary cross-entropy loss) and anoptimization algorithm has to be chosen (i.e. Adam [30]). In addition, a few hyperparametersmust be set:

• Number of epochs: An epoch is completed when the entire dataset is passed forwardand backward through a neural network. Since the learning process is iterative, thedataset has to be passed more than once to properly update the network’s weights.

13https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

26

4.6. Evaluation

However, if the number of epochs is too high the model may become overfitted. Theauthors of [15] recommend this value to be between 2 and 4.

• Batch size: Total number of training examples present in a single batch. For example,if a dataset has 200 samples and the batch size is 50, it will take four iterations (200divided by 50) to complete one epoch. It is necessary to split the dataset into smallerbatches because usually a full epoch is too big to be fed to a machine at once. Thisvalue will be higher or lower depending on the maximum sequence length parameter,because if both are too high it may cause out-of-memory errors. As an example, themaximum batch size on a single Titan X GPU (12GB RAM) with TensorFlow 1.11.0 is 32for input sequences of length 128, but can be increased to 64 if the maximum sequencelength is reduced to 64 14.

• Learning rate: Determines the step size that the Adam optimizer must take at each iter-ation. The model was trained with a learning rate of 1e-4 with linear decay. Accordingto [15], recommended values for fine-tuning are 2e-5, 3e-5 or 5e-5.

A recent publication that investigates fine-tuning methods for BERT [73] was used to getsome insight about the best settings for the text classification task.

Once the training process begins, the model is fed with the tensors containing the sentenceembeddings that are used by the final fully connected layer to generate the logits for thepositive class. These logits are compared to the true labels in order to calculate the actualloss value and update the weights accordingly in the optimization step. After each epochthe validation set is used to measure the performance on unseen data. It is a good practiceto keep track of the evolution of both training and validation loss to detect if the model isbeing overfitting by excessive training. Finally, the performance of the fine-tuned model isevaluated on test data.

4.6 Evaluation

Among the evaluation measures presented in Section 2.4, the chosen metric for this work willbe the F1-score. The main reason is that, as it was seen in the dataset analysis from Section4.2, around 20% of the tweets from the binary classification task are offensive. This meansthat precision and recall are not a reliable option. The AUC measure was also discarded forbeing too optimistic in highly imbalanced problems. Another compelling reason to use F1is that it is the most widely used measure in related work, and comparison is always easierwhen the same measure is used. Apart from the F1 value, accuracy will also be reported tobetter understand the results.

14https://github.com/google-research/bert

27

5 Results

5.1 Traditional machine learning approach

The aim of this initial experiment is to answer research question 1, about whether using sen-timent as a feature can help to face the problem of offensive language detection.

As a starting point, three different classifiers were used as baselines: Multinomial NaïveBayes, Support Vector Machine (SVM) and Random Forest. In all cases, the models weretuned using a cross-validated grid-search over a list of parameters. A wide range of hyper-parameters were tested in order to find the optimal configurations. Then, the best hyper-parameters were used to train and test a final classifier using 10-fold cross-validation. A ratioof 10:1 was used to split the data into training and test sets, and 10% of the training data wasused for validation during the learning process.

Language Model F1 Accuracy

ArabicNaïve Bayes 0.8065 0.8904Linear SVM 0.8508 0.9097Random Forest 0.8188 0.8964

DanishNaïve Bayes 0.6149 0.8692Linear SVM 0.7461 0.9123Random Forest 0.7772 0.9182

EnglishNaïve Bayes 0.6378 0.7062Linear SVM 0.7169 0.7588Random Forest 0.7183 0.7739

GreekNaïve Bayes 0.6819 0.7820Linear SVM 0.7706 0.8350Random Forest 0.7769 0.8446

TurkishNaïve Bayes 0.6476 0.8265Linear SVM 0.7437 0.8417Random Forest 0.6777 0.8465

Table 5.1: Naïve Bayes, SVM and Random Forest results for each dataset.

28

5.1. Traditional machine learning approach

Table 5.1 shows the results for the traditional machine learning models on the differentdatasets. As it can be seen, F1 scores vary from 61.49% (Naïve Bayes on Danish) to 85.08%(SVM on Arabic). Looking at the differences between models, it is clear that SVM and Ran-dom Forest perform much better than Naïve Bayes overall. In terms of languages, the highestscores are obtained for Arabic in all cases. On the other hand, the lowest F1 score with Ran-dom Forest was 67.77% for the Turkish dataset and 71.69% for SVM on the English set. NaïveBayes obtained results between 60% and 70% except for Arabic, case which exceeded 80%.

Since the full datasets were used in the previous experiment, the evaluated models weretrained on more or less data depending on the language. For a more objective comparison,all datasets were reduced to the Danish dataset’s size preserving the exact same proportion:2,864 NOT labels and 425 OFF labels. The results obtained in this case are presented in Table5.2. It can be seen that there was a significant drop in all categories due to the shortage of dataand the increased class imbalance. This time the values range from 53.14% (Naïve Bayes onEnglish) to 78.55% (Linear SVM on Arabic). Looking at the languages, Arabic obtained againthe highest results although Random Forest was slightly better on Danish in this occasion.On the other hand, the lowest scores were registered for English.

Language Model F1 Accuracy

ArabicNaïve Bayes 0.7238 0.9048Linear SVM 0.7855 0.9179Random Forest 0.7474 0.9124

DanishNaïve Bayes 0.6149 0.8692Linear SVM 0.7461 0.9123Random Forest 0.7772 0.9182

EnglishNaïve Bayes 0.5314 0.8568Linear SVM 0.5773 0.8738Random Forest 0.6284 0.8708

GreekNaïve Bayes 0.5857 0.8699Linear SVM 0.6663 0.8927Random Forest 0.7198 0.9009

TurkishNaïve Bayes 0.5547 0.8693Linear SVM 0.6043 0.8763Random Forest 0.6251 0.8860

Table 5.2: Same experiment as Table 5.1 with each dataset reduced to 3,289 samples.

In the remaining experiments from this section, the previous SVM classifier with TF-IDFalone will be compared to the same classifier with additional features. The configurationis the same as that used for the previous experiments, as well as the pre-processing steps.

Table 5.3 displays the results for each combination of features. The Emojis_Neg, Emojis_Neu,Emojis_Pos and Emojis_Score features were computed using the Emoji Sentiment Ranking,and correspond to the negative, neutral, positive and overall score given by that lexicon,respectively. For every tweet containing one or more emojis, the average sentiment score wascalculated by translating each emoji to its numeric value from the ranking. This values werealso used to convert the emojis generated by the DeepMoji model to numbers. The weightedaverage of the top one, three and five emojis were used as features providing similar results(0.7194, 0.7222 and 0.7196 F1 scores, respectively). The VADER sentiment lexicon was alsoused to assign a sentiment score to each tweet, obtaining similar results (0.7193). Finallydifferent combinations of these features were tested to see how they complement each other.The best combination of features produced an F1 score of 73.03%.

29

5.2. Deep learning approach

Model F1 Accuracy

SVM + TFIDF 0.7169 0.7588SVM + TFIDF + DeepMoji_TOP1 0.7194 0.7469SVM + TFIDF + DeepMoji_TOP3 0.7222 0.7585SVM + TFIDF + DeepMoji_TOP5 0.7196 0.7473SVM + TFIDF + VADER 0.7193 0.7469SVM + TFIDF + VADER + DeepMoji 0.7245 0.7612SVM + TFIDF + avg(VADER,DeepMoji) 0.7249 0.7618SVM + TFIDF + Emojis_Neg + Emojis_Pos 0.4052 0.5822SVM + TFIDF + Emojis_Neg + Emojis_Neu + Emojis_Pos 0.5943 0.6285SVM + TFIDF + Emojis_Score 0.7184 0.7703SVM + TFIDF + Emojis_Score + avg(VADER,DeepMoji) 0.7303 0.7737SVM + TFIDF + Sentiment + Lexical 0.7267 0.7216

Table 5.3: Results for SVM with different sets of emotion features using the English dataset.

As VADER is an English lexicon and DeepMoji is trained exclusively on English data, theseinitial experiments were performed using only the English dataset. All the tools used togather the previously presented results were described in Section 4.3.

As a final experiment regarding features, a collection of lexical features were added to the set:

• Number of user mentions.

• Number of URLs.

• Percentage of capital letters.

• Number of blacklisted words.

• Number of punctuation marks.

• Number of characters.

• Number of words.

• Average word length.

As it can be seen in the very last row of Table 5.3, combining such features with all the sen-timent features (emojis, VADER and DeepMoji) resulted in a 72.67% F1-score and 72.16%accuracy.

5.2 Deep learning approach

The following experimental set up aims to answer research question 2 by feeding a Convo-lutional Neural Network with different types of embeddings. The ambition is to see if thereis a significant difference between using word-level and subword-level pre-trained word em-beddings.

The CNN architecture described in Section 4.5 was used with two different embeddings:Word2Vec 1 and fastText 2. The model architecture utilized by Word2Vec is continuous skip-gram, with a vector size of 100 and no lemmatization performed. On the other hand, the

1http://vectors.nlpl.eu/repository/2https://fasttext.cc/docs/en/pretrained-vectors.html

30

5.2. Deep learning approach

fastText embeddings have dimension 300. The embeddings of out-of-vocabulary words wererandomly initialized.

In this case, emojis were not replaced from the original tweets, hoping that their abnormalcharacter combinations might be naturally learnt by the network when working at character-level [90]. The rest of pre-processing steps were applied as usual. In contrast to the previousexperiment, in this case there is no need to perform feature extraction as it is done automat-ically by the convolutional layers. This is why the CNN is considered a feature-extractingarchitecture.

After experimenting with different numbers of epochs and batch sizes, it was decided totrain the system for 10 epochs using a batch size of 32. Early stopping and model checkpointcallbacks were used to prevent overfitting the training set in case the number of epochs is toohigh. One tenth of the data was used for testing, and the training set was further split so that10% of it could be used for validation. In the cases where balancing techniques were used, itwas only applied to the training set. Results can be found in Table 5.4.

Language Embeddings Balancing F1 Accuracy

Arabic

Word2Vec True 0.840 0.908Word2Vec False 0.800 0.883fastText True 0.814 0.874fastText False 0.761 0.850

Danish

Word2Vec True 0.753 0.899Word2Vec False 0.735 0.894fastText True 0.770 0.915fastText False 0.710 0.884

English

Word2Vec True 0.744 0.723Word2Vec False 0.690 0.746fastText True 0.703 0.736fastText False 0.700 0.765

Greek

Word2Vec True 0.669 0.781Word2Vec False 0.653 0.795fastText True 0.729 0.810fastText False 0.690 0.794

Turkish

Word2Vec True 0.713 0.842Word2Vec False 0.683 0.849fastText True 0.704 0.823fastText False 0.681 0.844

Table 5.4: Comparison of CNN results for different languages and embeddings.

It can be deduced from the results above that balancing the datasets does help since the F1scores where higher every time that the over-sampling technique was applied, regardless ofthe language or the type of embedding. Word2Vec provided better results for Arabic (84.4%),English (74.4%) and Turkish (71.3%). Contrarily, Danish (77.0%) and Greek (72.9%) obtainedbetter results when fastText was used.

Once again, aiming for a more objective comparison, the larger datasets were downsampledto the size of the smallest one (Danish) to get an idea of the actual impact of the dataset size.In this occasion balancing was not applied, so all the subsets had a 12% of offensive tweetsjust like the original Danish set (2,864 offensive tweets and 425 normal ones). The collectedresults are displayed in Table 5.5, showing a significant performance decrease in terms of F1score.

31

5.3. Transfer learning approach

Language Embeddings F1 Accuracy

Arabic Word2Vec 0.738 0.903fastText 0.636 0.863

Danish Word2Vec 0.735 0.894fastText 0.710 0.884

English Word2Vec 0.662 0.839fastText 0.559 0.869

Greek Word2Vec 0.585 0.821fastText 0.575 0.799

Turkish Word2Vec 0.484 0.796fastText 0.490 0.809

Table 5.5: Same experiment as Table 5.4 with each dataset reduced to 3,289 samples.

5.3 Transfer learning approach

Finally, the cross-lingual capabilities of multilingual models were studied to answer researchquestion 3. As we are dealing with datasets in five different languages, BERT’s multilingualmodel (henceforth M-BERT) was mainly used. Nonetheless, the English model (EN-BERT)was also tested with a necessary translation step prior to tuning.

The models were fine-tuned for 4 epochs using a batch size of 32. The sequence length wasset to 128 as it was seen that the vast majority of tokenized tweets were short enough to fitin a vector of that length. For optimization, Adam was used with a learning rate of 3e-5 andno weight decay. In both cases the Base version was used, which is composed of a stack of 12transformer encoders in contrast to the 24 encoders from the most computationally expensiveLarge model.

Language F1 Accuracy

Arabic 0.875 0.920Danish 0.774 0.895English 0.744 0.773Greek 0.806 0.848Turkish 0.784 0.880

Table 5.6: M-BERT results.

Language F1 Accuracy

Arabic->English 0.873 0.919Danish->English 0.799 0.915English 0.791 0.807Greek->English 0.809 0.859Turkish->English 0.738 0.849

Table 5.7: EN-BERT results.

Tables 5.6 and 5.7 show the results for each dataset using the multilingual and the monolin-gual models, respectively. The results obtained with M-BERT range from 74.4% (English) to87.5% (Arabic). Translating the different datasets to English and feeding them to EN-BERTprovided similar results, especially for Arabic and Greek which practically did not vary. En-glish and Danish experimented an increase of around 3% in terms of F1 score, while Turkishwas the only dataset that was significantly penalised with a decrease of 4.6% (from 78.4% to73.8%). This might imply that the quality of the translation from Turkish to English was notas good as for other languages.

Next, M-BERT was used again to examine its cross-lingual model transfer ability. Since mostof the related work focuses on what models trained in a specific language (usually English)capture about that same language, it seemed interesting to experiment with M-BERT’s abilityto generalize across languages. Similarly to what was done by Pires et al. [55] for the tasks ofNamed Entity Recognition (NER) and Part-of-speech (POS) tagging, the multilingual model

32

5.3. Transfer learning approach

was evaluated in a language other than the one used in the fine-tunning phase. The resultsfor all possible combinations of training-test languages are summarized in Table 5.8.

Since a couple of datasets were used in each run, it was possible to use more data than inprevious occasions. In order to take advantage of that, the Danish set was temporarily leftaside due to its small size. In all cases the same amount of data was used for training andevaluation:

• Training set of 8,000 samples: 6,400 NOT and 1,600 OFF (20% imbalance).

• Test set of 4,000 samples: 3,200 NOT and 800 OFF (20% imbalance).

PPPPPPPPTrainTest English Greek Turkish Arabic

English F1 ACC 0.583 0.745 0.550 0.755 0.496 0.783Greek 0.470 0.777 F1 ACC 0.532 0.739 0.508 0.793Turkish 0.555 0.782 0.560 0.761 F1 ACC 0.548 0.815Arabic 0.486 0.766 0.483 0.775 0.487 0.789 F1 ACC

Table 5.8: F1 and accuracy results for zero-shot learning experiment with M-BERT.

It can be seen in Table 5.8 how the results went significantly down overall. In comparisonto the results obtained when training and testing was done in the same language, where F1scores were between 74% and 87%, now the values are around 50%. Interestingly enough,the highest F1 score was obtained when training M-BERT in English and evaluating in Greek(58.3%), while the lowest (47.0%) was the exact opposite situation: training in Greek andtesting in English.

Similarly to what was previously done, the same experiment was repeated with the reduceddatasets, this time including Danish in the comparison. The sizes of the datasets are summa-rized below:

• Training set of 3,289 samples: 2,864 NOT and 425 OFF (12.9% imbalance).

• Test set of 1,645 samples: 1,432 NOT and 213 OFF (12.9% imbalance).

PPPPPPPPTrainTest Danish English Greek Turkish Arabic

Danish F1 ACC 0.610 0.803 0.494 0.867 0.499 0.861 0.479 0.871English 0.618 0.856 F1 ACC 0.489 0.791 0.479 0.790 0.448 0.800Greek 0.493 0.810 0.457 0.789 F1 ACC 0.538 0.722 0.541 0.738Turkish 0.483 0.869 0.447 0.800 0.449 0.800 F1 ACC 0.452 0.801Arabic 0.463 0.863 0.459 0.794 0.484 0.792 0.514 0.781 F1 ACC

Table 5.9: F1 and accuracy results for zero-shot with M-BERT using the reduced datasets.

Table 5.9 displays the results obtained when tuning M-BERT with 3,289 tweets in one lan-guage and evaluating it on 1,645 previously unseen tweets in a different language. The high-est F1 score is obtained when training in Danish and testing in English (61.8%) followed veryclosely by the opposite case: training in English and testing in Danish (61.0%). The rest oflanguage pairs obtain scores between 44.7% (Turkish-English) and 54.1% (Greek-Arabic).

33

5.4. Additional experiments

Next, even though EN-BERT is expected to perform poorly in zero-shot learning, the sameexperiment as before was done using EN-BERT hoping to draw some conclusion out of it. Theresults are summarized in Table 5.10, where it can be seen how, in most of the cases, only thebaseline scores were obtained because the classifier would simply assign every new instanceto the majority class. When one of the languages involved was Danish, the baseline obtainedby classifying all samples to the majority class (NOT) is 46.5% F1 and 87.1% accuracy. In allother cases, since larger and less imbalanced datasets were used, making all predictions tothe NOT class gives an F1 score of 44.4% and an accuracy of 80%.

PPPPPPPPTrainTest Danish English Greek Turkish Arabic

Danish F1 ACC 0.626 0.860 0.465 0.871 0.470 0.870 0.465 0.871English 0.579 0.875 F1 ACC 0.444 0.780 0.455 0.801 0.446 0.800Greek 0.465 0.871 0.444 0.800 F1 ACC 0.444 0.800 0.505 0.593Turkish 0.465 0.871 0.444 0.800 0.444 0.800 F1 ACC 0.447 0.799Arabic 0.465 0.871 0.444 0.800 0.554 0.755 0.448 0.792 F1 ACC

Table 5.10: F1 and accuracy results for zero-shot with EN-BERT using the reduced datasets.

5.4 Additional experiments

Having already experimented with word and subword-level embeddings as input for a CNN,it remained pending to test the same network with contextualized embeddings. To do so,the dynamic embeddings generated by M-BERT were fed to a CNN. Table 5.11 shows theresulting F1 scores, which go from 75.3% (Danish) to 85.1% (Arabic). The lowest accuracy is79.0% for English, while the highest is 90.9% for Arabic.

Language F1 Accuracy

Arabic 0.851 0.909Danish 0.753 0.903English 0.765 0.790Greek 0.798 0.846Turkish 0.756 0.866

Table 5.11: Result obtained feeding BERT embeddings to a CNN.

As a final experiment, BERT models that were further pre-trained on a large general-domain monolingual corpus in the target language will be used to see the actual bene-fits of further pre-training BERT. Table 5.12 shows the F1 and accuracy values obtained byfine-tuning Arabic-BERT (88.8% F1-score), Nordic BERT (81.4%), GreekBERT (83.8%) andBERTurk (81.9%) with the corresponding datasets. These are the best results achieved inthe project, outperforming M-BERT and EN-BERT as well as the previous approaches.

Language Model F1 Accuracy

Arabic Arabic-BERT 0.888 0.932Danish Nordic BERT 0.814 0.930Greek GreekBERT 0.838 0.876Turkish BERTurk 0.819 0.892

Table 5.12: Result obtained with further pre-trained BERT models.

34

6 Analysis and discussion

6.1 Results

The experimentation with some of the most prominent methods in the literature allowedthe exploration of a variety of possibilities for the task of offensive language detection. Theoverall results were predictable to some extent, as the literature overview had already shownwhat does and does not work in the field. For instance, it was expectable to obtain betterresults with deep contextualized language models than using traditional machine learningalgorithms. However, there are always some particularities that are worth mentioning. Inthis section, the results presented in Chapter 5 are discussed and reflected upon.

6.1.1 Traditional machine learning approach

The first experiments with traditional machine learning algorithms showed that there is noclear winner between Support Vector Machine and Random Forest. In all cases they outper-formed the Naïve Bayes baseline by around 10% in terms of F1 score, but comparing them toeach other it can be said that the results were very similar. For example, SVM did better onthe Arabic and Turkish datasets while Random Forest was the best for Danish.

When reducing the datasets size to the exact same dimensions as Danish, the smallest one,Table 6.1 shows that SVM is more harshly penalized than Random Forest overall. Only Arabicexperimented a similar loss for SVM and Random Forest, but it was the least castigated bythe sampling reduction when using SVM. The decrease in F1 score by SVM in the other threelanguages (English, Greek and Turkish) was almost twice as severe as the one suffered byRandom Forest. Regarding Naïve Bayes, in all cases the loss was around 10%.

As it can be seen in the same table, the accuracy values went slightly up when the datasetswere reduced. This means that the models are more biased towards the majority class, be-cause otherwise a positive increment would be experienced in the F1 score as well. Arabicwas clearly the least penalized language, with almost no change in terms of accuracy. Turkish,despite originally being a dataset three times larger than Greek’s or English’s, had a similarloss of F1 score in comparison to the other two languages. This suggests that the performanceimprovement stops growing with the number of samples after a certain threshold is passed.

35

6.1. Results

Language Model 4F1 4Accuracy

ArabicNaïve Bayes �8.27% +1.44%Linear SVM �6.53% +0.82%Random Forest �7.14% +1.60%

EnglishNaïve Bayes �10.64% +15.06%Linear SVM �14.16% +11.50%Random Forest �8.99% +9.69%

GreekNaïve Bayes �9.62% +8.79%Linear SVM �10.43% +5.77%Random Forest �5.71% +5.63%

TurkishNaïve Bayes �9.29% +4.28%Linear SVM �13.94% +3.46%Random Forest �5.26% +3.95%

Table 6.1: Increase in F1 and accuracy when the dataset size is reduced to 3,289.

Regarding the feature engineering experiment, the results prove that the sentiment of a tweetdoes help to get better predictions. Only when the three polarity scores (negative, neutraland positive) for emojis were used as features was there a notorious decrease in both F1 andaccuracy. This might imply that they are uninformative features that confuse the system,but using the overall score as a single feature had a positive outcome. Also Deepmoji andVADER alone provided a slight improvement when added to the base feature set composedof the TF-IDF matrix alone. The best result was obtained by combining the average of thesetwo features with the Emojis’ sentiment score.

The addition of lexical features to the model did not produce a significant performance boost.It might be that the chosen features are not informative enough, even though they have beenused in previous works [46].

6.1.2 Deep learning approach

In this case, the importance of the dataset size was even more noticeable. By comparing Table5.4 and Table 5.5, it can be observed that the decrement of F1 score was between 15% and 20%when using fastText. In the case of Word2Vec the decrement was quite different dependingon the language: 22.4% for Turkish, 10.2% for Arabic, 8.4% for Greek and 8.2% for English.The performance decline may be more severe for the Turkish set because it was originallythe larger: 35,284 samples while the other three datasets were around 10,000. This suggeststhat the results are somehow proportional to the amount of data used for training, whichreinforces the idea that the performance of deep learning models increases with the amountof data.

This experiment also served to see the importance of class imbalance. In all cases the over-sampling technique contributed positively in the results. With Word2Vec all languages ex-perienced an increase between 1.6% (in the case of Greek) and 5.4% (in the case of English).Similarly, with fastText all the numbers were improved after oversampling: 6.0% for Danish,5.3% for Arabic, 3.9% for Greek, 2.3% for Turkish and 0.3% for English.

However, the main objective of this experiment was to confirm that the subword-level infor-mation carried by fastText embeddings would allow a better classification of user-generatedcontent than Word2Vec embeddings. Although [42] found that character-based approachesare superior than token-based ones, this characteristic was not clearly reflected in the ob-tained results. Another work found that character-based CNNs perform better for largedatasets [90], but their findings could not be replicated in this experiment. One possible

36

6.1. Results

cause for this discrepancy is that, even when all the available samples are used, the datasetsare not large enough to reach this kind of conclusion.

The ambition when using a CNN was that the non-linearity of the network would lead to asuperior prediction performance [24]. However, the CNN results were not always superiorto the ones obtained with SVM or Random Forest. One reason for that might be that theavailable datasets only contain a few thousand samples, as it was seen that CNN is penalizedharder by the shortage of data than traditional machine learning approaches. It remains tobe seen the results that could be achieved by a CNN on a large corpus. Another possibilityis that the application of pre-processing steps to the data was counter-productive in the caseof CNN, since the network could have learned to abstract salient details from the unconven-tional writing style of tweets.

Despite not being directly related to the second research question, the usage of contextualizedembeddings provided by BERT’s multilingual model (Table 5.11) obtained better results thanboth Word2Vec and fastText. The only exception was on the Danish set, in which case fastTextobtained slightly better results. The reason for this general improvement might have to dowith the fact that BERT’s dynamic representations are informed by the surrounding words.

6.1.3 Transfer learning approach

The first results obtained using BERT models (tables 5.6 and 5.7) explain the success of trans-fer learning approaches in recent years. Both M-BERT and EN-BERT achieved the best resultsso far in this work, outperforming the previously tested Naïve Bayes, Random Forest, Sup-port Vector Machine and Convolutional Neural Network.

Once again, the numbers reveal that the task is not equally challenging for all languages, sincethe values collected are quite different between them. As usual, Arabic receives the highestscore by a safe distance. Danish also has a quite high score considering that is disposes ofconsiderably less data than the other datasets. Having seen the importance of the dataset sizein previous experiments, the fact that Danish performs well despite having this disadvantagemight imply that it would do better than most languages in similar conditions. In the case ofthe English dataset, the fact that EN-BERT obtained worse results with the original Englishdata than with the translated versions from Arabic, Danish and Greek is very revealing. Eventhough the translation step should introduce some noise to the process, detection seems to bemore reliable for tweets that were originally written in another language.

With regards to the zero-shot cross-lingual transfer experiment, the comparison between M-BERT and EN-BERT makes it clear that the former does learn deep multilingual representa-tions rather than simply memorizing its large vocabulary, as it was claimed by a recent study[55]. According to their results for the POS tagging task, M-BERT is able do zero-shot cross-lingual model transfer even between languages with no lexical overlap at all, even thoughthe more overlap the better because then word pieces seen during training will appear againwhile testing. The reason is that the embedding of a subword unit that appears in differentlanguages might accommodate information from all the languages were it has been seen. Thisis reflected in our results obtained with EN-BERT, where the model did not learn anythingat all during training for all language pairs other than English-Danish. Only when trainedon English and evaluated on Danish and vice versa, some of the word pieces learned duringtraining can be used to make predictions. In all other cases EN-BERT was simply classify-ing all tweets to the majority class, or almost all of them in some cases. On the other hand,M-BERT obtained better results for all language pairs but those were still far from the state-of-the-art. This is possible thanks to the shared multilingual vocabulary of WordPiece tokens,which benefits zero-shot cross-lingual transfer with the overlap of subword units betweenlanguages [83].

37

6.1. Results

Hypothesising about why M-BERT does not generalize equally well for all language pairs,the authors of [55] think that the reason has to do with their similarity in terms of syntac-tic typology: the model deals better with languages that have analogous ordering of words.Looking at the languages under study (Arabic, Danish, English, Greek and Turkish), the clos-est should be English and Danish as they are both Germanic languages. Greek has a differentgenus, meaning that it is not phylogenetically related to Danish or English, but it also be-longs to the Indo-European language family (see Table 6.2). An attempt was made to betterinterpret the results considering the syntactic characteristics of the five languages used, butno clear conclusions were reached. The addition of languages with different properties to thestudy could help to find some pattern and reach significant conclusions. In any case, eventhough it is clear that M-BERT performs better between closely related languages, it is hardto tell which factors influence cross-lingual transfer since interaction between subword unitsin the deep network is not visible by the practitioner.

Language Family Genus

Arabic Afro-Asiatic SemiticDanish Indo-European GermanicEnglish Indo-European GermanicGreek Indo-European GreekTurkish Altaic Turkic

Table 6.2: Families and Genus of the languages under study. Source: WALS [17].

The last results reported in the previous chapter (Table 5.12) made it clear that BERT modelsfurther pre-trained on the target language can easily achieve state-of-the-art results. Evenwith no pre-processing steps or balancing techniques applied to the data, all the F1 scores ob-tained by these language-specific models were superior to the rest of results from this project.This includes the results obtained with M-BERT and EN-BERT, highlighting the importanceof training BERT models on the target language. The increase of F1 score was of 1.3% onArabic data (from 87.5% to 88.8%), 4% on Danish (from 77.4% to 81.4%), 3.2% on Greek (from80.6% to 83.8%) and 3.5% on Turkish (from 78.4% to 81.9%). In the following section, thethesis results are compared to state-of-the-art to put them into context.

6.1.4 Comparison to state-of-the-art

A final interpretation of the results is done by comparing them to the actual state-of-the-art.As OffensEval 2020 is the most recent shared task to date, the results from its top performingteams (which have been published during the development of this project) will be used as areference. Table 6.3 contains the ten best F1 scores obtained for each dataset of OffensEval’ssub-task A. This can be considered a very objective and reliable comparison as all teamsdisposed of the same datasets that were used in this thesis, even though some of them mighthave augmented them with their own data.

The numerical values displayed in Table 6.3 prove that the results presented in this reportare particularly good. Except for English, the best results obtained for the other languagesare equal or even above the average of the 10 highest ranked teams. These results wereobtained with the further pre-trained BERT models, but even the prediction scores obtainedwith M-BERT and EN-BERT (Tables 5.6 and 5.7) would earn a place among the top ten rankedsubmissions. Interestingly enough, the random forest implementation would also be amongthe top ranked teams for the Danish dataset with a 77.7% score. This is quite impressiveconsidering that it is a much simpler model than the ensembles, deep learning and transferlearning techniques that must have been chosen by most of the teams. Also on Danish data,

38

6.2. Method

Rank English Arabic Greek Turkish Danish

1 0.9223 0.9017 0.8520 0.8258 0.81202 0.9204 0.9016 0.8510 0.8167 0.80203 0.9198 0.8989 0.8480 0.8141 0.79204 0.9187 0.8972 0.8430 0.8101 0.77705 0.9166 0.8958 0.8330 0.7967 0.77506 0.9162 0.8902 0.8320 0.7933 0.77407 0.9151 0.8778 0.8260 0.7859 0.77208 0.9146 0.8744 0.8230 0.7815 0.76909 0.9139 0.8714 0.8220 0.7790 0.769010 0.9136 0.8691 0.8200 0.7789 0.7670

Avg. TOP-10 0.9171 0.8881 0.8350 0.7982 0.7809

Best thesis 0.7915 0.8884 0.8387 0.8193 0.8145

Table 6.3: Official results from the TOP-10 teams in OffensEval 2020 sub-task A.

the CNN implementation with fastText embeddings would be among the top 10 participantswith its 77.0% score.

The reason why it was not possible to match the best results for the English dataset might bethat OffensEval’s participants used an assortment of English resources that were not consid-ered for this work. As one of the goals was to compare the models’ performance in differentlanguages, no additional pre-processing steps were applied to the English data to be fair withlow-resource languages that do not have as many available tools. Another possibility is thatparticipants used the new English dataset (released in OffensEval 2020) composed of over 9million tweets, which was not considered for this work. Using such a massive amount of dataallows training better models, which might explain the notable improvement in overall re-sults with respect to the previous edition of OffensEval. In fact, according to the results fromthe 2019 edition (where only the OLID dataset used in this project was provided) the top 10teams obtained F1 scores around 80%. Even the best result from that year, 82.9%, would befar from being ranked among the top ten in the latest edition. This reinforces the hypothesisthat the size of the training set might have something to do with it, especially consideringthat BERT and similar models were already available when OffensEval 2019 took place.

In any case, at the time of writing, it is only possible to speculate about what might be theapproaches taken by other teams because the system description papers have not been pub-lished yet. It will surely be enlightening to read the papers once they are published, as theymay reveal what is the best way for detecting offensive language at this time. Comparingthe work of all participating teams would have contributed with valuable conclusions to thisproject, but unfortunately this will have to be left as future work.

6.2 Method

In this section, the method described in Chapter 4 is discussed and criticized to highlight thepotential consequences for the results. The use and selection of sources is also discussed froma critical point of view.

6.2.1 Self-critical stance

As for most research studies, the design of this work was subject to a few limitations thatare worth mentioning. One of the main limitations of the method has to do with the lack ofexpertise of the author. Considering that some of the techniques and tools used throughout

39

6.2. Method

the project were seen for the first time, it is understandable that one cannot become an expertin every possible aspect. An effort has been made to understand the idea behind each algo-rithm and the role of each setting, so that all choices are well motivated. However, in somecases the less relevant hyperparameters have been left to their default values due to a lack ofknowledge about their actual effect, especially in those models with many free parameters.This makes it hard to claim that the optimal model has been found in every experiment, asthere can always exist an alternative configuration that leads to better results. However, thecloseness to state-of-the-art results makes this a minor concern.

Another common concern about the method in this type of projects is often related to thedata. The sample size, the data collection process and the quality of the final dataset canhave potential consequences for the final results. In the case of the OLID dataset, tweets wereretrieved from the Twitter Search API using specific keywords that are often present in offen-sive messages. The fact that most of them are political in nature (i.e. liberals, MAGA, antifa)might make it harder to generalize the results to other domains. Another data limitation thatshould be borne in mind is that the datasets used had different sizes and percentages of of-fensiveness, which can affect the results in different ways. For this reason both factors werealways considered when analysing the experimental results.

With regard to the evaluation measure, the only concern is that the F1 score used gives theexact same weight to evaluation and recall. However, in a real live application, the cost of afalse positive should be different from the cost of a false negative, and this should be reflectedin the weights used.

6.2.2 Replicability, reliability and validity

In order to achieve replicability, an attempt was made to describe the methodology in a highlevel of detail in Chapter 4. With the provided information, the reader should be able to obtainsimilar results to the experiments performed. As none of the resources used for this projectwas proprietary, all tests could be replicated easily with the open-source tools presented insection 4.1. The only problem about this is that, over the years, newer versions of the softwareused can include changes that affect the replicability of this work, but this is of course out ofour control.

In terms of reliability, it can be said that any result obtained from any machine learning algo-rithm has a certain level of randomness, in the sense that the exact outcome might vary fromone run to the next. This is accentuated in the case of small datasets, where the way samplessplit into train and test sets can condition a lot the evaluation results. For instance, it was seenthat the F1 score in experiments involving the Danish dataset (which contains very few offen-sive tweets) could vary up to 5% in consecutive runs, without changing any parameter otherthan the random seed. To preserve validity in these cases, all results presented in previouschapters are the average of five or ten runs, depending on the total runtime of the code.

Regarding the validity of the datasets, one concern could be the gold standards, as it has beenseen in Section 4.2 that some tweets could have been labelled differently. Despite detectingsome of these cases, it was decided not to remove them from the dataset as it would nega-tively affect the reproducibility of the study. By not modifying the publicly available datasets,it should be possible for a reader to replicate the method on the exact same data and conse-quently obtain the same results. Also, the fact that the labelling criteria of the datasets hasbeen published by the authors means that it would be even possible to label a new datasetfollowing the same guidelines and still obtain very similar results to the ones presented inthis thesis, even if the tweets itself are not exactly the same.

40

6.2. Method

A final validity concern might be risen by the oversampling method that was occasionallyused to balance datasets. Adding copies of samples to the training data is not ideal as theydo not add any new information to the model and make the algorithm prone to overfitting.More complex data augmentation techniques could be used to generate new samples that arenot exactly equal to the original ones (i.e. paraphrases or synthetic copies [9]) but this willhave to be left for future work. At least, care was taken to always apply oversampling aftersplitting the data and only to the training set, to ensure that all samples used for evaluationhad never been seen before by the system.

6.2.3 Source criticism

The literature consulted for this project includes mostly peer reviewed articles that have beenpublished in reputable journals and conferences. Some sources are preprint versions whichtherefore have not been peer-reviewed yet, but still they are reliable and high quality sources(i.e. BERT’s original paper). There is also an assortment of publications from well-knownworkshops and shared tasks, most of them closely related to the topic under study. Sometextbooks can also be found in the Bibliography section, although only to reference specificpages of these. There is no specific book that has been strictly followed during the develop-ment of the thesis.

It goes without saying that priority has been given to those papers that have been cited mostoften. The publication year has been taken into account as well. Because Natural LanguageProcessing is an ever changing field that evolves at a very high pace, it was important touse the most recent publications whenever possible. Some of the sources cited in the RelatedWork chapter (Chapter 3) were published a decade ago, but they serve to show how the fieldhas evolved over time and can help the reader to better understand how the current scenariohas been reached. Throughout the entire project, an effort was made to identify and referencethe original author of any idea or concept.

In general, the cited papers were collected from scientific databases such as ACM 1, Springer-Link 2 or IEEE Xplore 3, among others. The search engine used has been Google Scholar 4 inall cases, as it ranks the results in a way that makes it very easy to identify the most reliablesources. Furthermore, it provides a straight-forward way of obtaining BibTeX citations.

For the implementation part, a wide variety of coding websites and online repositories havebeen consulted repeatedly. However, many of these sources do not appear in the final reportsince they are not closely-related to the topic and their presence might affect the validity ofthe study. Only those websites that contain open-sourced code that has been used at somepoint are mentioned eventually in the form of footnotes.

Notice that the upcoming section (Section 6.3), where ethical and societal aspects are men-tioned, some of the sources are not as reliable as the rest since they come from rather journal-istic sources instead of scientific ones. However, they are still important to analyse the impactof the work in a wider context.

1https://www.acm.org/2https://link.springer.com/3https://ieeexplore.ieee.org/Xplore/home.jsp4https://scholar.google.com/

41

6.3. The work in a wider context

6.3 The work in a wider context

As any scientific study, this thesis must take into consideration the ethical and societal aspectsrelated to the work [62] [86]. News that came to light in recent times show that social mediacan have a considerable impact on our highly connected society [66] [3]. We have seen ev-erything ranging from the use of big data to manipulate elections to the always controversialdebate on the right to privacy.

In the case of online offensive language, it is clear that it can emotionally affect the individualsor group of persons to which it is addressed. Even though the internet is supposed to be aplace where everyone is free to openly express their opinion, the 2015 Harassment Survey ofthe Wikimedia Foundation showed that over half of the victims of online harassment decreasetheir participation after being attacked 5. Another study found that 67% of Finnish teenagershad been exposed to hate material and 21% of them acknowledged having been its target [48].This study from 2014 goes one step further stating that victims are more likely to be unhappy.Moreover, an offensive post can be the trigger for an online discussion that can easily escalateto the offline world. As a case in point, early identification of hateful users could preventcatastrophic events such as the Pittsburgh synagogue mass shooting from 2018, made by ananti-Semitic social media user with a large history of extremist posts 6.

Social media’s potential to generate harm cannot be ignored, especially considering thatnowadays social media sites are the only source of information for many people. This meansthat the misuse of online platforms can lead to the spread of misinformation and hate speechagainst minority groups very easily. A good example of this was made public in 2018, whenFacebook acknowledged that their platform had been used to incite violence in Myanmar[71]. There, the military started an online campaign to incite the murder, rape and forcedmigration of the Rohingya, a Muslim minority that has been brutally persecuted by Buddhistultranationalists in the last years. Another example of how ’dangerous speech’ 7 (a particularform of offensive language that promotes violence against a specific group) can have alarm-ing consequences took place in Sri Lanka in 2019. In this case, a single post from a Muslimuser (Don’t laugh more, 1 day u will cry) motivated the attack by Christians on mosques andmuslim shops that left three mortal victims. The solution taken by the Sri Lankan govern-ment was to temporarily block all social networks until the situation was under control, butthis is certainly not enough to prevent similar incidents from happening on the future.

Studies about posts’ diffusion dynamics have shown that content generated by hateful usersspreads much faster and further than non-hate speech [41]. This is why it is of key importanceto detect such behaviours at an early stage, before it is too late. More resources should beinvested to prevent this kind of situations, as it has been already demanded by human rightgroups. This work is a modest contribution to combat inappropriate behaviour in socialmedia (e.g. racism or xenophobia) to protect not only the most vulnerable members of oursociety but also regular users that might feel displeased by certain posts, always keeping inmind the fundamental right to free speech.

5https://upload.wikimedia.org/wikipedia/commons/5/52/Harassment_Survey_2015_-_Results_Report.pdf

6www.nytimes.com/2018/10/27/us/robert-bowers-pittsburgh-synagogue-shooter.html7dangerousspeech.org/

42

7 Conclusion

This final chapter reveals to what extent has the aim been achieved by explicitly answeringthe research questions and mentioning some of the project contributions. At last, a discussionon possible future work shows what could be done to benefit the field in different ways.

7.1 Summary and critical reflection

This study investigated different approaches to automatically detect offensive content in themicro-blogging platform Twitter. The aim was to build various supervised learning classifiersthat would help to answer the four research questions (see Section 1.3) that guided the workdescribed in this thesis.

In order to answer the first research question, an attempt was made to extract emotions fromtweets in different ways and use them as features to see their impact in the performanceresults. It was seen that sentiment analysis can be used for more accurate predictions, sincethe results were slightly better when such features were incorporated to the system.

The second research question aimed to gain some insight about the impact of using em-beddings at subword-level instead of word-level. For that, the same Convolutional NeuralNetwork was fed with Word2Vec and fastText embeddings using all the available datasets.However, no clear conclusions could be drawn in this case since the neither of them clearlyprevailed over the other, obtaining similar results in both cases. What did produce betterresults was the use of contextualized embeddings generated by BERT’s multilingual model.

For the third research question, the multilingual version of BERT was compared to its mono-lingual counterpart to analyse the cross-lingual capabilities of the former. The results of zero-shot learning experiments proved that multilingual models are able to generalize across lan-guages, even though it is not an effective methodology, as reflected by results that were farfrom state-of-the-art.

The last research question was about comparing the experimental results obtained for eachdataset to see if there were significant differences between different languages, and it wasseen that the language of the post does matter. Among the analysed languages, predictions

43

7.2. Future work

were always more accurate in Arabic, and less reliable in other languages like Turkish. Aneffort was made to compare languages in equal conditions whenever possible.

Furthermore, despite not being explicitly present in any of the research questions, the effect ofthe dataset size and the amount of class imbalance was taken into account when performingthe experiments and analysing the results. It was seen that both characteristics of the datasethave a significant impact in the final performance of the model, and that they can be easilyovercome with data enriching and oversampling techniques, respectively.

In summary, it was seen that the performance of the model depends on a large number offactors including the language, dataset size, type of embedding and class imbalance. Satis-factory results were achieved, achieving state-of-the-art in many cases, and it was possible toanswer all the research questions. Thus, it can be said that the aim of the thesis was fulfilled.

7.2 Future work

Even though this project achieved satisfactory results, there is always room for improvement.At this point, thanks to the insight gained during the development of the work, it is possibleto name a list of ideas that could not be explored mainly due to time constraints.

The intent of this thesis never was to test all the available methods that are supposed toperform well for the offensive tweet detection task. But, as future work, it might be interestingto add other popular models like BiLSTM or RNN to the results comparison. Also, combiningseveral models into an ensemble with plurality voting or a gradient boosting scheme can beassumed to provide promising results as well.

One possible way of improving the results would be to combine the OffensEval datasetsusing machine translation. It might also be interesting to explore the possibility of addingother publicly available datasets, even if the gold standard labels are not exactly the same.This might be specially beneficial for deep learning approaches as they tend to perform betterwith more data. Also, for the class imbalance problem, back-and-forth translation could beused to enrich the datasets with paraphrased versions of the original tweets instead of simplerepetition.

Regarding sentence embedding features, it is also an option to further explore ways to com-bine them, such as averaging or concatenation. Moreover, instead of loading pre-trainedword vectors, training our own on domain-specific data might reduce the amount of out-of-vocabulary words and enhance the overall results.

In another line of research, there is a need for further investigation in languages other thanEnglish. As it usually happens, most of the success in the field is focused in a few popularlanguages. The scarcity of manually crafted linguistic resources for low-resource languagesobstructs the pre-processing phase, which is an important part of the process even whenusing multilingual language models. Things as basic as a spelling correction algorithm ora stopwords list are not so easy to find for certain languages, and they would certainly behelpful for this kind of tasks that deal with user-generated content.

Also in the future, a more complex approach would be able to extract meaningful featuresfrom a tweet’s context. This would include information about the author (e.g. user’s his-tory and demographics), related tweets (e.g. preceding comments and replies), non-textualdata (e.g. images and videos), and so on. It is reasonable to think that all this additionalinformation would help to further improve the system’s predictions.

44

Bibliography

[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. “Tensorflow:A system for large-scale machine learning”. In: 12th tUSENIXu Symposium on OperatingSystems Design and Implementation (tOSDIu 16). 2016, pp. 265–283.

[2] Charu C Aggarwal and ChengXiang Zhai. “A survey of text classification algorithms”.In: Mining text data. Springer, 2012, pp. 163–222.

[3] Jacob Amedie. “The impact of social media on society”. In: (2015).

[4] Mariette Awad and Rahul Khanna. “Support vector machines for classification”. In:Efficient Learning Machines. Springer, 2015, pp. 39–66.

[5] Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. “Deep learn-ing for hate speech detection in tweets”. In: Proceedings of the 26th International Confer-ence on World Wide Web Companion. 2017, pp. 759–760.

[6] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. “A neuralprobabilistic language model”. In: Journal of machine learning research 3.Feb (2003),pp. 1137–1155.

[7] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. “A training algorithm foroptimal margin classifiers”. In: Proceedings of the fifth annual workshop on Computationallearning theory. 1992, pp. 144–152.

[8] Pete Burnap and Matthew L Williams. “Cyber hate speech on twitter: An applicationof machine classification and statistical modeling for policy and decision making”. In:Policy & Internet 7.2 (2015), pp. 223–242.

[9] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.“SMOTE: synthetic minority over-sampling technique”. In: Journal of artificial intelli-gence research 16 (2002), pp. 321–357.

[10] Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. “Detecting offensive language insocial media to protect adolescent online safety”. In: 2012 International Conference onPrivacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.IEEE. 2012, pp. 71–80.

45

Bibliography

[11] Çagrı Çöltekin. “A Corpus of Turkish Offensive Language on Social Media”. In: Pro-ceedings of the 12th International Conference on Language Resources and Evaluation. ELRA.2020.

[12] Alexis Conneau, Holger Schwenk, Loı c Barrault, and Yann Lecun. “Very deep convo-lutional networks for text classification”. In: arXiv preprint arXiv:1606.01781 (2016).

[13] Maral Dadvar, FMG de Jong, Roeland Ordelman, and Dolf Trieschnigg. “Improvedcyberbullying detection using gender information”. In: Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012). University of Ghent. 2012.

[14] Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. “Automatedhate speech detection and the problem of offensive language”. In: Eleventh internationalaaai conference on web and social media. 2017.

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In: arXivpreprint arXiv:1810.04805 (2018).

[16] Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, andNarayan Bhamidipati. “Hate speech detection with comment embeddings”. In: Proceed-ings of the 24th international conference on world wide web. 2015, pp. 29–30.

[17] Matthew S Dryer and Martin Haspelmath. “The world atlas of language structuresonline”. In: (2013).

[18] Kevin Durkin and Jocelyn Manning. “Polysemy and the subjective lexicon: Semanticrelatedness and the salience of intraword senses”. In: Journal of Psycholinguistic Research18.6 (1989), pp. 577–612.

[19] Paul Ekman. “Are there basic emotions?” In: (1992).

[20] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. “Usingmillions of emoji occurrences to learn any-domain representations for detecting senti-ment, emotion and sarcasm”. In: Conference on Empirical Methods in Natural LanguageProcessing (EMNLP). 2017.

[21] Darja Fišer, Tomaž Erjavec, and Nikola Ljubešic. “Legal framework, dataset and an-notation schema for socially unacceptable online discourse practices in Slovene”. In:Proceedings of the first workshop on abusive language online. 2017, pp. 46–51.

[22] Björn Gambäck and Utpal Kumar Sikdar. “Using convolutional neural networks to clas-sify hate-speech”. In: Proceedings of the first workshop on abusive language online. 2017,pp. 85–90.

[23] Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun Long. “Alexicon-based approach for hate speech detection”. In: International Journal of Multime-dia and Ubiquitous Engineering 10.4 (2015), pp. 215–230.

[24] Yoav Goldberg. “A primer on neural network models for natural language processing”.In: Journal of Artificial Intelligence Research 57 (2016), pp. 145–152.

[25] Jeremy Howard and Sebastian Ruder. “Universal language model fine-tuning for textclassification”. In: arXiv preprint arXiv:1801.06146 (2018).

[26] Clayton J Hutto and Eric Gilbert. “Vader: A parsimonious rule-based model for senti-ment analysis of social media text”. In: Eighth international AAAI conference on weblogsand social media. 2014.

[27] Thorsten Joachims. “Text categorization with support vector machines: Learning withmany relevant features”. In: European conference on machine learning. Springer. 1998,pp. 137–142.

[28] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. “Bag of tricksfor efficient text classification”. In: arXiv preprint arXiv:1607.01759 (2016).

46

Bibliography

[29] Yoon Kim. “Convolutional neural networks for sentence classification”. In: arXivpreprint arXiv:1408.5882 (2014).

[30] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980 (2014).

[31] Efthymios Kouloumpis, Theresa Wilson, and Johanna Moore. “Twitter sentiment anal-ysis: The good the bad and the omg!” In: Fifth International AAAI conference on weblogsand social media. 2011.

[32] Ritesh Kumar, Atul Kr Ojha, Shervin Malmasi, and Marcos Zampieri. “Benchmark-ing aggression identification in social media”. In: Proceedings of the First Workshop onTrolling, Aggression and Cyberbullying (TRAC-2018). 2018, pp. 1–11.

[33] Ritesh Kumar, Atul Kr Ojha, Marcos Zampieri, and Shervin Malmasi. “Proceedingsof the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)”. In:Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018).2018.

[34] Irene Kwok and Yuzhou Wang. “Locate the hate: Detecting tweets against blacks”. In:Twenty-seventh AAAI conference on artificial intelligence. 2013.

[35] Guillaume Lample and Alexis Conneau. “Cross-lingual language model pretraining”.In: arXiv preprint arXiv:1901.07291 (2019).

[36] Batia Laufer. “What’s in a word that makes it hard or easy: some intralexical factorsthat affect the learning of words”. In: Vocabulary: Description, acquisition and pedagogy.Cambridge university press, 1997, pp. 140–155.

[37] Quoc Le and Tomas Mikolov. “Distributed representations of sentences and docu-ments”. In: International conference on machine learning. 2014, pp. 1188–1196.

[38] Els Lefever, Bart Desmet, and Guy De Pauw. “TA-COS 2018: 2nd Workshop on TextAnalytics for Cybersecurity and Online Safety: Proceedings”. In: TA-COS 2018–2ndWorkshop on Text Analytics for Cybersecurity and Online Safety, collocated with LREC 2018,11th edition of the Language Resources and Evaluation Conference. European Language Re-sources Association (ELRA). 2018.

[39] Bing Liu. Sentiment analysis and opinion mining. Morgan & Claypool Publishers, 2012.

[40] Shervin Malmasi and Marcos Zampieri. “Challenges in discriminating profanity fromhate speech”. In: Journal of Experimental & Theoretical Artificial Intelligence 30.2 (2018),pp. 187–202.

[41] Binny Mathew, Ritam Dutt, Pawan Goyal, and Animesh Mukherjee. “Spread of HateSpeech in Online Social Media”. In: Proceedings of the 10th ACM Conference on Web Sci-ence. WebSci ’19. Boston, Massachusetts, USA: Association for Computing Machinery,2019, pp. 173–182. ISBN: 9781450362023.

[42] Yashar Mehdad and Joel Tetreault. “Do characters abuse more than words?” In: Pro-ceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue.2016, pp. 299–303.

[43] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient estimation ofword representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013).

[44] Hamdy Mubarak, Kareem Darwish, and Walid Magdy. “Abusive language detectionon Arabic social media”. In: Proceedings of the First Workshop on Abusive Language Online.2017, pp. 52–56.

[45] Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, and Ahmed Ab-delali. “Arabic Offensive Language on Twitter: Analysis and Experiments”. In: arXivpreprint arXiv:2004.02192 (2020).

47

Bibliography

[46] Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. “Abu-sive language detection in online user content”. In: Proceedings of the 25th internationalconference on world wide web. 2016, pp. 145–153.

[47] Petra Kralj Novak, Jasmina Smailovic, Borut Sluban, and Igor Mozetic. “Sentiment ofemojis”. In: PloS one 10.12 (2015), e0144296.

[48] Atte Oksanen, James Hawdon, Emma Holkeri, Matti Näsi, and Pekka Räsänen. “Expo-sure to online hate among young social media users”. In: Sociological studies of children& youth 18.1 (2014), pp. 253–273.

[49] Marc Pamies, Emily Öhman, Kaisla Kajava, and Jörg Tiedemann. “LT@Helsinki atSemEval-2020 Task 12:Multilingual or language-specific BERT?” In: Proceedings of the14th International Workshop on Semantic Evaluation. Unpublished. 2020.

[50] Ji Ho Park and Pascale Fung. “One-step and two-step classification for abusive lan-guage detection on twitter”. In: arXiv preprint arXiv:1706.01206 (2017).

[51] John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. “Deep Learningfor User Comment Moderation”. In: ACL 2017 (2017), p. 25.

[52] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, BertrandThirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, VincentDubourg, et al. “Scikit-learn: Machine learning in Python”. In: Journal of machine learn-ing research 12.Oct (2011), pp. 2825–2830.

[53] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global vec-tors for word representation”. In: Proceedings of the 2014 conference on empirical methodsin natural language processing (EMNLP). 2014, pp. 1532–1543.

[54] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,Kenton Lee, and Luke Zettlemoyer. “Deep contextualized word representations”. In:arXiv preprint arXiv:1802.05365 (2018).

[55] Telmo Pires, Eva Schlinger, and Dan Garrette. “How multilingual is MultilingualBERT?” In: arXiv preprint arXiv:1906.01502 (2019).

[56] Zeses Pitenis, Marcos Zampieri, and Tharindu Ranasinghe. “Offensive Language Iden-tification in Greek”. In: Proceedings of the 12th Language Resources and Evaluation Confer-ence. ELRA. 2020.

[57] Robert Plutchik. “A general psychoevolutionary theory of emotion”. In: Theories of emo-tion. Elsevier, 1980, pp. 3–33.

[58] Martin F Porter et al. “An algorithm for suffix stripping.” In: Program 14.3 (1980),pp. 130–137.

[59] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improvinglanguage understanding by generative pre-training”. In: URL https://s3-us-west-2. ama-zonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding pa-per. pdf (2018).

[60] Guy Rosen. Community Standards Enforcement Report, November 2019 Edition. Facebook,Inc., 2019.

[61] Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, and PreslavNakov. “A Large-Scale Semi-Supervised Dataset for Offensive Language Identifica-tion”. In: arxiv. 2020.

[62] Per Runeson and Martin Höst. “Guidelines for conducting and reporting case studyresearch in software engineering”. In: Empirical software engineering 14.2 (2009), p. 131.

[63] Gerard Salton and Christopher Buckley. “Term-weighting approaches in automatic textretrieval”. In: Information processing & management 24.5 (1988), pp. 513–523.

48

Bibliography

[64] Anna Schmidt and Michael Wiegand. “A survey on hate speech detection using nat-ural language processing”. In: Proceedings of the Fifth International Workshop on NaturalLanguage Processing for Social Media. 2017, pp. 1–10.

[65] Mike Schuster and Kaisuke Nakajima. “Japanese and korean voice search”. In: 2012IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.2012, pp. 5149–5152.

[66] Shabnoor Siddiqui, Tajinder Singh, et al. “Social media its impact with positive andnegative aspects”. In: International Journal of Computer Applications Technology and Re-search 5.2 (2016), pp. 71–75.

[67] Gudbjartur Ingi Sigurbergsson and Leon Derczynski. “Offensive Language and HateSpeech Detection for Danish”. In: Proceedings of the 12th Language Resources and Evalua-tion Conference. ELRA. 2020.

[68] Sara Owsley Sood, Elizabeth F Churchill, and Judd Antin. “Automatic identification ofpersonal insults on social news sites”. In: Journal of the American Society for InformationScience and Technology 63.2 (2012), pp. 270–285.

[69] Wiktor Soral, Michał Bilewicz, and Mikołaj Winiewski. “Exposure to hate speech in-creases prejudice through desensitization”. In: Aggressive behavior 44.2 (2018), pp. 136–146.

[70] K Starmer. “Guidelines on prosecuting cases involving communications sent via socialmedia”. In: London, GB: Crown Prosecution Service 25 (2013), p. 39.

[71] Alexandra Stevenson. “Facebook admits it was used to incite violence in Myanmar”.In: The New York Times 6 (2018).

[72] Hui-Po Su, Zhen-Jie Huang, Hao-Tsung Chang, and Chuan-Jie Lin. “Rephrasing pro-fanity in chinese text”. In: Proceedings of the First Workshop on Abusive Language Online.2017, pp. 18–24.

[73] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. “How to fine-tune BERT fortext classification?” In: China National Conference on Chinese Computational Linguistics.Springer. 2019, pp. 194–206.

[74] Yanmin Sun, Andrew KC Wong, and Mohamed S Kamel. “Classification of imbalanceddata: A review”. In: International journal of pattern recognition and artificial intelligence23.04 (2009), pp. 687–719.

[75] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advancesin neural information processing systems. 2017, pp. 5998–6008.

[76] Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, and Amit P Sheth. “Cursing inenglish on twitter”. In: Proceedings of the 17th ACM conference on Computer supportedcooperative work & social computing. 2014, pp. 415–425.

[77] William Warner and Julia Hirschberg. “Detecting hate speech on the world wide web”.In: Proceedings of the second workshop on language in social media. Association for Compu-tational Linguistics. 2012, pp. 19–26.

[78] Zeerak Waseem. “Are you a racist or am i seeing things? annotator influence on hatespeech detection on twitter”. In: Proceedings of the first workshop on NLP and computationalsocial science. 2016, pp. 138–142.

[79] Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy, and Joel Tetreault. “Proceed-ings of the First Workshop on Abusive Language Online”. In: Proceedings of the FirstWorkshop on Abusive Language Online. 2017.

[80] Zeerak Waseem and Dirk Hovy. “Hateful symbols or hateful people? predictive fea-tures for hate speech detection on twitter”. In: Proceedings of the NAACL student researchworkshop. 2016, pp. 88–93.

49

Bibliography

[81] Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. “Overview of the germeval2018 shared task on the identification of offensive language”. In: (2018).

[82] Thomas Wolf, L Debut, V Sanh, J Chaumond, C Delangue, A Moi, P Cistac, T Rault,R Louf, M Funtowicz, et al. “Huggingface’s transformers: State-of-the-art natural lan-guage processing”. In: ArXiv, abs/1910.03771 (2019).

[83] Shijie Wu and Mark Dredze. “Beto, bentz, becas: The surprising cross-lingual effective-ness of bert”. In: arXiv preprint arXiv:1904.09077 (2019).

[84] David Yarowsky. “Unsupervised word sense disambiguation rivaling supervisedmethods”. In: 33rd annual meeting of the association for computational linguistics. 1995,pp. 189–196.

[85] Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D Davison, April Kontostathis, andLynne Edwards. “Detection of harassment on web 2.0”. In: Proceedings of the ContentAnalysis in the WEB 2 (2009), pp. 1–7.

[86] Robert K Yin. Case study research and applications: Design and methods. Sage publications,2017.

[87] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, andRitesh Kumar. “Predicting the Type and Target of Offensive Posts in Social Media”.In: Proceedings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics (NAACL). 2019, pp. 1415–1420.

[88] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, andRitesh Kumar. “SemEval-2019 Task 6: Identifying and Categorizing Offensive Lan-guage in Social Media (OffensEval)”. In: Proceedings of The 13th International Workshopon Semantic Evaluation (SemEval). 2019.

[89] Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov,Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çagrı Çöltekin. “SemEval-2020Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval2020)”. In: Proceedings of SemEval. 2020.

[90] Xiang Zhang, Junbo Zhao, and Yann LeCun. “Character-level convolutional net-works for text classification”. In: Advances in neural information processing systems. 2015,pp. 649–657.

[91] Ye Zhang and Byron Wallace. “A sensitivity analysis of (and practitioners’ guideto) convolutional neural networks for sentence classification”. In: arXiv preprintarXiv:1510.03820 (2015).

[92] Ziqi Zhang, David Robinson, and Jonathan Tepper. “Detecting hate speech on twitterusing a convolution-gru based deep neural network”. In: European semantic web confer-ence. Springer. 2018, pp. 745–760.

[93] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, AntonioTorralba, and Sanja Fidler. “Aligning books and movies: Towards story-like visual ex-planations by watching movies and reading books”. In: Proceedings of the IEEE interna-tional conference on computer vision. 2015, pp. 19–27.

50