a new approach for group spam detection in social media...

4
AbstractSocial media is the new upcoming logical marketing field. Nowadays, social media has become an appreciated data source. Data collected from Twitter, for instance, can be used to conclude people’s opinions for marketing or social studies. However, group spammers that aim to promote or damage certain product for the sake of making unjustified profit is a real challenge. In this study, we proposed a new approach to detect group spammers in social media from Arabic text. The proposed approach “AGSD” uses Apache Lucene with Nutch and Solr open sources to collect Arabic tweets. Then, the suspected group spammers can be detected by applying both a group and individual spam detection techniques. KeywordsFaked Review, Fake Reviewer Group, Group Opinion Spamming, Opinion Spamming I. INTRODUCTION OCIAL media is the new upcoming logical marketing field. Nowadays, social media has become an appreciated data source. Data collected from Twitter, for instance, can be used to conclude people’s opinions for marketing or social studies [18]. In real world scenario, service providers and customers are always aware of other people’s opinions about the quality of products. The huge textual data from social media can be analyzed and used to support decision making. The extracted opinions from social media may seriously affect the sales of products or services if the reliability of the opinions is not considered and can be misused by any organizations [14]. For example, in 2009, a manager at Belkin paid for writing positive reviews on a poor rating router [7]. The main advantage of social media is that it permits anyone to express his/her opinions without the need for disclosing his/her true identity. This anonymity encourages people with bad intentions to post misleading faked opinions about target organizations, products, or services. Those people are called opinion spammers and their actions are called opinion spamming [13]. To get reliable results of opinion mining in social media, these faked opinions must be detected and removed. However, with the increasing usage of social media, opinion spamming will become more and more extensive, Nujud Aloshban 1 is with the Imam Muhammad ibn Saud University, Riyadh, Saudi Arabia. Hmood Aldossari 2 , Chairman of Information Systems Department in King Saud University, Riyadh, Saudi Arabia. sophisticated, and harder to detect. Consequently, several detection algorithms are proposed [11]. The main purpose of opinion spam detection is to detect spamming faked reviews, fake reviewer, and fake reviewer group. Detecting one type of spam can help detect other spams. However, each type has its own specific features and different methods of detection. .Some people are enticed by others to write faked reviews. For example, businesses give discounts in exchange of writing faked positive reviews [11]. Opinion or review spams are the reviews that are irrelevant to the reviewed product, and the purpose of these reviews is to fake the product’s quality to mislead the customers [6]. The motivation of this spam is make more profit by writing untrustworthy reviews and deceitful ratings [19]. Group spamming is very harmful because a number of members in a group can control the sentiment on an entity, thus they can totally mislead the social media. Moreover, group spammers work in conspire together to promote or to damage the reputation of a target entity. They may be individuals who know each other or a single who is registered with multiple user IDs and is called as Sockpuppeting. Group spamming has some particular features that can be detected and analyzed. However, identifying group spammers is a real challenge and hard to detect. [3], [4]. The aim of this study is to find out the group of spammers in social media from Arabic text. This is not because of its unique nature that makes it different from Latin languages but also due to widespread of usage on the social media. It is highlighted that using Arabic in the social media is three folds that is given in English [15]. The rest of the paper is organized as follow. Section II reviews the literature of spam detection. Section III presents the proposed framework. Section IV discussion. Finally Section V concludes the paper. II. RELATED WORK Spam detection can be classified into three groups: review spam, review spammers, and group of spammer’s detection [12]. While many researches have been conducted to detect the first two groups, the third group has a little attention and needs more exploration. Several researches in the literature have been proposed for review spam detection. Lin et al., [12] analyzed the features of faked reviews. They proposed six-time-sensitive characteristics to distinguish the faked reviews based on A New Approach for Group Spam Detection in Social Media for Arabic Language (AGSD) Nujud Aloshban 1 , and Hmood Aldossari 2 S 8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE) http://dx.doi.org/10.15242/IIE.E0516011 20

Upload: others

Post on 13-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Abstract— Social media is the new upcoming logical marketing

field. Nowadays, social media has become an appreciated data

source. Data collected from Twitter, for instance, can be used to

conclude people’s opinions for marketing or social studies. However,

group spammers that aim to promote or damage certain product for

the sake of making unjustified profit is a real challenge. In this study,

we proposed a new approach to detect group spammers in social

media from Arabic text. The proposed approach “AGSD” uses

Apache Lucene with Nutch and Solr open sources to collect Arabic

tweets. Then, the suspected group spammers can be detected by

applying both a group and individual spam detection techniques.

Keywords— Faked Review, Fake Reviewer Group, Group

Opinion Spamming, Opinion Spamming

I. INTRODUCTION

OCIAL media is the new upcoming logical marketing

field. Nowadays, social media has become an appreciated

data source. Data collected from Twitter, for instance, can

be used to conclude people’s opinions for marketing or social

studies [18].

In real world scenario, service providers and customers are

always aware of other people’s opinions about the quality of

products. The huge textual data from social media can be

analyzed and used to support decision making. The extracted

opinions from social media may seriously affect the sales of

products or services if the reliability of the opinions is not

considered and can be misused by any organizations [14]. For

example, in 2009, a manager at Belkin paid for writing

positive reviews on a poor rating router [7].

The main advantage of social media is that it permits anyone

to express his/her opinions without the need for disclosing

his/her true identity. This anonymity encourages people with

bad intentions to post misleading faked opinions about target

organizations, products, or services. Those people are called

opinion spammers and their actions are called opinion

spamming [13]. To get reliable results of opinion mining in

social media, these faked opinions must be detected and

removed. However, with the increasing usage of social media,

opinion spamming will become more and more extensive,

Nujud Aloshban1 is with the Imam Muhammad ibn Saud University,

Riyadh, Saudi Arabia.

Hmood Aldossari2, Chairman of Information Systems Department in King

Saud University, Riyadh, Saudi Arabia.

sophisticated, and harder to detect. Consequently, several

detection algorithms are proposed [11].

The main purpose of opinion spam detection is to detect

spamming faked reviews, fake reviewer, and fake reviewer

group. Detecting one type of spam can help detect other

spams. However, each type has its own specific features and

different methods of detection. .Some people are enticed by

others to write faked reviews. For example, businesses give

discounts in exchange of writing faked positive reviews [11].

Opinion or review spams are the reviews that are irrelevant

to the reviewed product, and the purpose of these reviews is to

fake the product’s quality to mislead the customers [6]. The

motivation of this spam is make more profit by writing

untrustworthy reviews and deceitful ratings [19].

Group spamming is very harmful because a number of

members in a group can control the sentiment on an entity,

thus they can totally mislead the social media. Moreover,

group spammers work in conspire together to promote or to

damage the reputation of a target entity. They may be

individuals who know each other or a single who is registered

with multiple user IDs and is called as Sockpuppeting. Group

spamming has some particular features that can be detected

and analyzed. However, identifying group spammers is a real

challenge and hard to detect. [3], [4].

The aim of this study is to find out the group of spammers in

social media from Arabic text. This is not because of its

unique nature that makes it different from Latin languages but

also due to widespread of usage on the social media. It is

highlighted that using Arabic in the social media is three folds

that is given in English [15].

The rest of the paper is organized as follow. Section II

reviews the literature of spam detection. Section III presents

the proposed framework. Section IV discussion. Finally

Section V concludes the paper.

II. RELATED WORK

Spam detection can be classified into three groups: review

spam, review spammers, and group of spammer’s detection

[12]. While many researches have been conducted to detect the

first two groups, the third group has a little attention and needs

more exploration.

Several researches in the literature have been proposed for

review spam detection. Lin et al., [12] analyzed the features of

faked reviews. They proposed six-time-sensitive

characteristics to distinguish the faked reviews based on

A New Approach for Group Spam Detection in

Social Media for Arabic Language (AGSD)

Nujud Aloshban1, and Hmood Aldossari

2

S

8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)

http://dx.doi.org/10.15242/IIE.E0516011 20

reviewer behaviors and review contents. They suggested a

supervised solution and a threshold-based solution to early

identify the faked reviews. Xie et al., [16] observed that

spammer’s attacks usually burst and had either positive or

negative association to the rating. They claimed that to detect

spams oddly correlated temporal patterns were to be detected.

The authors constructed multidimensional time series on

aggregate statistic to detect these patterns and developed a

hierarchical algorithm to detect the time windows robustly.

Their experimental results have showed that the proposed

method was effective in reviewing spams detection.

Many researches were proposed to detect spammers. Li et

al. [8] proposed two semi-supervised methods that can be

applied on a large amount of unlabeled data based on machine

learning algorithms to identify the spammers. Their results

proved that the proposed method was effective in detecting

spammers.

To the best of our knowledge, only one study was conducted

by Mukherjee et al. [4] to detect group of spammers. This

study proposed a new algorithm that based on unsupervised

machine learning algorithms. The algorithm consists of three

steps: frequent pattern mining, identification of a set of feature

indicators and ranking. Frequent pattern mining detects

repeated patterns to find the group of spammers who targeted

the same set of products and services. Classification uses a set

of criteria including group review content similarity, group

rating deviation, writing reviews together in a short time

window, writing reviews right after the product launch, and

many others to classify the detected groups as group

spammers. Ranking is performed procedure using relational

model steps [3], [4]. The proposed algorithm was conducted

on Latin language as well as implemented as test bed using

WEKA to calculate the results without complete

implementation.

Wahsheh et al. [9] proposed Arabic opinions spam detection

system (SPAR) based on reviews that are posted through

Maktoob-Yahoo social network. SPAR is a spam detection

system that can read opinions in Arabic language and

categorize them as: spam or non-spam. Using special metrics,

SPAR can detect the level of spam for each spam opinion.

Additionally, SPAR classifies each non-spam opinion into

three classes; positive, negative, and neutral using support

vector machine classifier. Their experiments have proved that

SPAR was effective to detect Arabic opinions spam. Hammad

[1] applied Support Vector Machine (SVM), Naive Bayes

(NB) and K-Nearest Neighbor (KNN) were used to classify

the Arabic opinion reviews into spam or non-spam classes.

The study divided spams into: reviews about brands only, non-

reviews, irrelevant reviews, and general reviews. Moreover,

the inconsistency between review comments and the review

rate is examined.

In this study, we extend Mukherjee’s algorithm [4] to detect

group of spammers in social media from Arabic text.

III. FRAMEWORK

In this paper, we propose AGSD (Arabic Group Spammers

Detection) to identify and detect group spammers in social

media from Arabic text. It is based on several open sources to

deal with the Arabic texts written in the social media. AGSD

has four main phases: crawling, preprocessing, spamming

activities detection, and Individual member behavior scanning.

A. Crawling and Indexing

The proposed AGSD is based on Apache Lucene, Nutch

and Solr open sources to build a complete search engine.

Lucene is not a complete search engine but an indexing library

that is used to generate inverted index. To index web

documents, Lucene needs a crawler to collect web documents

such as Apache Nutch [2] .Nutch works under the Hadoop

framework and it supports cluster capabilities, distributed

computation, using MapReduce and a distributed file system

which makes it scalable and cost effective [5]. To improve the

system implementation, Apache Solr are used to give indexing

and searching abilities based on Lucene library [2].The

advantage of Nutch and Solr is that it can handle big data

effectively. Finally, PHP web search page created in order to

make the decision maker or any users able to search for any

target product (see Fig. 1).

Web crawlers, are systems that collect a corpus of web

pages for web indexing and querying. Web indexing, is index

processing of already downloaded pages by crawler so that

users allow to make queries.

In AGSD, the Apache Nutch web crawler is used and it is

integrated with Apache Solr. Apache Nutch is configured to

crawl the twitter website only. The AGSD crawls all the tweets

in twitter website, stores and indexes the details of some web

pages. Solr stores the important information related to each

URL such as title, URL, content of the page, anchors to other

pages, metatag description and metatag keywords. After

storing and preprocessing, the data is ready to be searched

using Solr .The next subsection will describe the preprocessing

phase.

Fig. 1 Search Page

8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)

http://dx.doi.org/10.15242/IIE.E0516011 21

Fig. 2 Solr Libraries Based on Lucene Libraries [10]

B. Preprocessing

Solr works on documents, where each document consists of

searchable fields such as id, title, content, anchor and URL.

The rules for searching each field are shaped using field type

definitions such as string, long, URL, and Arabic text. A field

type definition describes the analyzers, tokenizes and filters

which control searching behavior [10]. We have two

analyzers, one is for indexing time and the other is for

querying time. When a document is added or updated, its

fields will be analyzed and tokenized, and those tokens are

stored in the Solr’s index. When a query is issued, the query

text is again analyzed, tokenized and then matched against

tokens in the Solr’s index. This critical function of

tokenization is handled by Tokenizer components. Also, there

are TokenFilter components which modify the token stream

[10].

Solr used the following libraries that are based on Lucene

libraries as seen in Fig. 2 .The libraries that used in this study

are:

WhitespaceTokenizerFactory: it determines which tokens of

characters are separated by splitting on whitespace [17].

WordDelimiterFilterFactory: It splits words into sub-words

and performs optional transformations on sub-word groups.

One use for WordDelimiterFilter is to help match words

with different delimiters [17].

SynonymFilterFactory: It matches strings of tokens and

replaces them with other strings of tokens using an external

file to define the synonyms. For this reason, an Arabic

synonyms file is manually made by writing synonyms from

an Arabic dictionary (around 200 Arabic synonyms) as

shown in Fig. 3. When expand property is true, a synonym

will be expanded to all equivalent synonyms during query

time. E.g. = طباخطاهي [17].

StopFilterFactory: Discards common words by using

external stop-words file, which is made specifically for

Arabic language. E.g.: "حتى" and "[17] "كان.

ArabicNormalizationFilterFactory. It makes normalization

for efficiency operating on an Arabic term buffer. For

example: normalization أ to [17] ا.

ArabicAnalyzer. It implements light-stemming as specified

by ArabicStemFilter .For instance, removal of an attached

definite article, conjunction, and prepositions. In addition

to stemming of common suffixes [17].

RemoveDuplicatesTokenFilterFactory. It considers any

tokens in the same logical position in the token stream as

previous tokens and remove them. [17].

Fig. 3 Arabic Synonyms

C. Spamming Activities Detection

The spamming activities that may indicate a spam which

identified by Mukherjee et al. (2012) are:

1. Group Time Window (GTW)

This rule detects group members who post tweet on a

particular entity during a short time interval, which is

considered a spamming activity. If there are tweets on the

searched entity that posted in the same day more than π tweets

(π is a constant value), these tweets of that day are suspected

tweets. Consequently, their writers are suspected spammers of

this entity.

2. Group Content Similarity (GCS)

The similarity of the tweet’s content is to be checked against

all other tweets on the searched entity. If the similarity is more

than β (β is a constant value that represents similarity score),

then these tweets are suspected to be spam and their writers are

suspected to be spammers.

3. Group Early Time Frame (GETF)

This indicator detects if a group members are the first

people to post tweet on entity. The exact date of launching

each entity is not available. Consequently, the first review of

this entity is considered its launching date. If the tweet date is

smaller than α + λ (α is the earliest tweet date, and λ is a

constant value that represents the period of the early tweets),

then these tweets are suspected to be spam and their writers are

suspected to be spammers

D. Individual Members' Behaviors Scanning

The individual members' behavior is monitored to extract

the suspected spammers [4]. The tweets' writers of the

previous stage (in section C), are the only ones to be checked

again in this stage. The revealed writers of this stage (stage D)

are considered the suspected group of spammers of a definite

identity.

8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)

http://dx.doi.org/10.15242/IIE.E0516011 22

1. Individual Content Similarity (ICS)

Individual spammers may review a product many times

posting duplicate or near duplicate reviews to increase the

product popularity. ICS models the activity of a particular

reviewer to examine all his/her tweets on a product to find out

if there are duplicated or near duplicated tweets on this

particular entity.

2. Individual Early Time Frame (IETF)

Like GETF, this checks if a reviewer is one of the reviewers

who made the tweet during the early time of lunching a

product.

3. Individual Member Coupling in a Group (IMC)

IMC detects if members worked with other members of the

group in posting faked reviews at the same date about the same

product. This is considered tightly coupled members which

indicate group spamming.

IV. DISCUSSION

This research is still at an early stage and many issues need

to be considered. For example, how to evaluate the proposed

approach is a challenge and needs more explanation. In our

context, a keyword about a given product can be entered to

retrieve relevant tweets about the target product. For example,

when searching for شركة االتصاالت, the proposed system obtains

all the tweets relevant to the subject. The potential group of

spammers who tweets on this entity is identified by applying

group and individual indicators as explained in the previous

sections.

To evaluate the proposed model, we intend to write some

fake and genuine tweets on a specific entity taking into

account several sets of unusual behaviors like those mentioned

in sections C and D. Then, the written tweets will be injected

into Solr. Consequentially, all the tweets including the fakes

we deliberately injected will be retrieved during the search for

the target entity. Proposed AGSD system will identify and

detect the group of spammers that wrote faked reviews to

proofs its feasibility .The accuracy of our proposed can be

determined by calculating recall and precision of our proposed

method.

V. CONCLUSION

Nowadays, individuals and organizations surf social media

to make purchasing decision. Because of this, spammers may

add faked reviews to promote and/or slander target products

[8]. For this reason, it is a must to detect opinion spamming to

ensure that social media remains a trusted reliable source of

public opinions. [11].This study is aim of identifying and

discovering the group spammers who post fake and unreliable

tweets.

This is a search engine for group spammers' detection

system which can search for Arabic product tweets and

identify group of spammers. It is based on group and

individual spam behavior indicators.

REFERENCES

[1] A. Abu Hammad, "An Approach for Detecting Spam in Arabic Opinion

Reviews," January 2013.

[2] A. Atreya, S. Chaudhari1, P. Bhattacharyya1 and G. Ramakrishnan,

"Building Multilingual Search Index using open," in Proceedings of the

3rd Workshop on South and Southeast Asian Natural Language

Processing (SANLP), pages 201–210,COLING 2012, Mumbai,

December 2012.

[3] A. Mukherjee, . B. Liu, J. Wang, . N. Glance and N. Jindal, "Detecting

Group Review Spam," in 20th international conference companion on

World wide web ,2011.

[4] A. Mukherjee, B. Liu and N. Glance, "Spotting Fake Reviewer Groups

in Consumer Reviews," in In Proceedings of the 21st international

conference on World Wide Web,2012.

http://dx.doi.org/10.1145/2187836.2187863

[5] A. Ricardo and C. Serrão, "Comparison of existing open-source tools

for Web crawling and indexing of free music,"

TELECOMMUNICATIONS, vol. 18, no. 1, JANUARY 2013.

[6] C. G. Harris, "Detecting Deceptive Opinion Spam Using Human

Computation," in AAAI Technical Report WS-12-08, 2012.

[7] D. Meyer, "Fake reviews prompt Belkin apology," [Online]. Available:

http://news.cnet.com/8301-1001_3-10145399-92.html.

[8] F. Li, M. Huang, Y. Yang and X. Zhu, "Learning to identify review

spam," in In IJCAI Proceedings-International Joint Conference on

Artificial Intelligence, 2011.

[9] H. A. Wahsheh, M. N. Al-Kabi and I. M. Alsmadi, "SPAR: A system to

detect spam in Arabic opinions," in In Applied Electrical Engineering

and Computing Technologies (AEECT), 2013.

http://dx.doi.org/10.1109/aeect.2013.6716442

[10] K. Shiraly, "Solr text field types, analyzers, tokenizers & filters

explained," 2010. [Online]. Available:

http://www.pathbreak.com/blog/solr-text-field-types-analyzers-

tokenizers-filters-explained.

[11] L. Bing , Sentiment Analysis and Opinion Mining, Synthesis Lectures

on Human Language Technologie, April 22, 2012, pp. 1-167.

[12] L. Yuming , . Z. Tao, H. WuI, . Z. Jingwei, W. Xiaoling and A. Zhou,

"Towards online anti-opinion spam: Spotting fake reviews from the

review sequence," in 2014 IEEE/ACM International Conference on

Advances in Social Networks Analysis and Mining (ASONAM 2014),

2014.

[13] N. Jindal and B. Liu, "Opinion Spam and Analysis," in In Proceedings

of the 2008 International Conference on Web Search and Data Mining

(pp. 219-230).

http://dx.doi.org/10.1145/1341531.1341560

[14] R. Nithish, M. Navaneeth Kishen, S. Sabarish, A. M. Abirami and A.

Askarunisa, "An Ontology based Sentiment Analysis for mobile

products using tweets," in In Advanced Computing (ICoAC), 2013 Fifth

International Conference , 2013

http://dx.doi.org/10.1109/icoac.2013.6921974.

[15] R. T. Khasawneh, H. A. Wahsheh, M. N. AI-Kabi and I. M. Aismadi,

"Sentiment analysis of arabic social media content: a comparative

study," in The 8th International Conference for Internet Technology

and Secured Transactions (ICITST-2013), 2013.

[16] S. Xie, G. Wang, S. Lin and P. Yu, "Review Spam Detection via

Temporal Pattern Discovery," in 18th ACM SIGKDD international

conference on Knowledge discovery and data mining,2012.

http://dx.doi.org/10.1145/2339530.2339662

[17] SolrWiki, "Analyzers, Tokenizers, and Token Filters," 2015. [Online].

Available:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Rem

oveDuplicatesTokenFilterFactory.

[18] X. Zhou, X. Tao, J. Yong and Z. Yang, "Sentiment analysis on tweets

for social event," in Proceedings of the 2013 IEEE 17th International

Conference on Computer Supported Cooperative Work in Design,

2013.

http://dx.doi.org/10.1109/cscwd.2013.6581022

[19] Y. Ma and F. Li, "Detecting review spam: Challenges and

opportunities," in In Collaborative Computing: Networking,

Applications and Worksharing, 2012 8th International Conference,

2012.

http://dx.doi.org/10.4108/icst.collaboratecom.2012.250640

8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)

http://dx.doi.org/10.15242/IIE.E0516011 23