a new approach for group spam detection in social media...
TRANSCRIPT
Abstract— Social media is the new upcoming logical marketing
field. Nowadays, social media has become an appreciated data
source. Data collected from Twitter, for instance, can be used to
conclude people’s opinions for marketing or social studies. However,
group spammers that aim to promote or damage certain product for
the sake of making unjustified profit is a real challenge. In this study,
we proposed a new approach to detect group spammers in social
media from Arabic text. The proposed approach “AGSD” uses
Apache Lucene with Nutch and Solr open sources to collect Arabic
tweets. Then, the suspected group spammers can be detected by
applying both a group and individual spam detection techniques.
Keywords— Faked Review, Fake Reviewer Group, Group
Opinion Spamming, Opinion Spamming
I. INTRODUCTION
OCIAL media is the new upcoming logical marketing
field. Nowadays, social media has become an appreciated
data source. Data collected from Twitter, for instance, can
be used to conclude people’s opinions for marketing or social
studies [18].
In real world scenario, service providers and customers are
always aware of other people’s opinions about the quality of
products. The huge textual data from social media can be
analyzed and used to support decision making. The extracted
opinions from social media may seriously affect the sales of
products or services if the reliability of the opinions is not
considered and can be misused by any organizations [14]. For
example, in 2009, a manager at Belkin paid for writing
positive reviews on a poor rating router [7].
The main advantage of social media is that it permits anyone
to express his/her opinions without the need for disclosing
his/her true identity. This anonymity encourages people with
bad intentions to post misleading faked opinions about target
organizations, products, or services. Those people are called
opinion spammers and their actions are called opinion
spamming [13]. To get reliable results of opinion mining in
social media, these faked opinions must be detected and
removed. However, with the increasing usage of social media,
opinion spamming will become more and more extensive,
Nujud Aloshban1 is with the Imam Muhammad ibn Saud University,
Riyadh, Saudi Arabia.
Hmood Aldossari2, Chairman of Information Systems Department in King
Saud University, Riyadh, Saudi Arabia.
sophisticated, and harder to detect. Consequently, several
detection algorithms are proposed [11].
The main purpose of opinion spam detection is to detect
spamming faked reviews, fake reviewer, and fake reviewer
group. Detecting one type of spam can help detect other
spams. However, each type has its own specific features and
different methods of detection. .Some people are enticed by
others to write faked reviews. For example, businesses give
discounts in exchange of writing faked positive reviews [11].
Opinion or review spams are the reviews that are irrelevant
to the reviewed product, and the purpose of these reviews is to
fake the product’s quality to mislead the customers [6]. The
motivation of this spam is make more profit by writing
untrustworthy reviews and deceitful ratings [19].
Group spamming is very harmful because a number of
members in a group can control the sentiment on an entity,
thus they can totally mislead the social media. Moreover,
group spammers work in conspire together to promote or to
damage the reputation of a target entity. They may be
individuals who know each other or a single who is registered
with multiple user IDs and is called as Sockpuppeting. Group
spamming has some particular features that can be detected
and analyzed. However, identifying group spammers is a real
challenge and hard to detect. [3], [4].
The aim of this study is to find out the group of spammers in
social media from Arabic text. This is not because of its
unique nature that makes it different from Latin languages but
also due to widespread of usage on the social media. It is
highlighted that using Arabic in the social media is three folds
that is given in English [15].
The rest of the paper is organized as follow. Section II
reviews the literature of spam detection. Section III presents
the proposed framework. Section IV discussion. Finally
Section V concludes the paper.
II. RELATED WORK
Spam detection can be classified into three groups: review
spam, review spammers, and group of spammer’s detection
[12]. While many researches have been conducted to detect the
first two groups, the third group has a little attention and needs
more exploration.
Several researches in the literature have been proposed for
review spam detection. Lin et al., [12] analyzed the features of
faked reviews. They proposed six-time-sensitive
characteristics to distinguish the faked reviews based on
A New Approach for Group Spam Detection in
Social Media for Arabic Language (AGSD)
Nujud Aloshban1, and Hmood Aldossari
2
S
8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)
http://dx.doi.org/10.15242/IIE.E0516011 20
reviewer behaviors and review contents. They suggested a
supervised solution and a threshold-based solution to early
identify the faked reviews. Xie et al., [16] observed that
spammer’s attacks usually burst and had either positive or
negative association to the rating. They claimed that to detect
spams oddly correlated temporal patterns were to be detected.
The authors constructed multidimensional time series on
aggregate statistic to detect these patterns and developed a
hierarchical algorithm to detect the time windows robustly.
Their experimental results have showed that the proposed
method was effective in reviewing spams detection.
Many researches were proposed to detect spammers. Li et
al. [8] proposed two semi-supervised methods that can be
applied on a large amount of unlabeled data based on machine
learning algorithms to identify the spammers. Their results
proved that the proposed method was effective in detecting
spammers.
To the best of our knowledge, only one study was conducted
by Mukherjee et al. [4] to detect group of spammers. This
study proposed a new algorithm that based on unsupervised
machine learning algorithms. The algorithm consists of three
steps: frequent pattern mining, identification of a set of feature
indicators and ranking. Frequent pattern mining detects
repeated patterns to find the group of spammers who targeted
the same set of products and services. Classification uses a set
of criteria including group review content similarity, group
rating deviation, writing reviews together in a short time
window, writing reviews right after the product launch, and
many others to classify the detected groups as group
spammers. Ranking is performed procedure using relational
model steps [3], [4]. The proposed algorithm was conducted
on Latin language as well as implemented as test bed using
WEKA to calculate the results without complete
implementation.
Wahsheh et al. [9] proposed Arabic opinions spam detection
system (SPAR) based on reviews that are posted through
Maktoob-Yahoo social network. SPAR is a spam detection
system that can read opinions in Arabic language and
categorize them as: spam or non-spam. Using special metrics,
SPAR can detect the level of spam for each spam opinion.
Additionally, SPAR classifies each non-spam opinion into
three classes; positive, negative, and neutral using support
vector machine classifier. Their experiments have proved that
SPAR was effective to detect Arabic opinions spam. Hammad
[1] applied Support Vector Machine (SVM), Naive Bayes
(NB) and K-Nearest Neighbor (KNN) were used to classify
the Arabic opinion reviews into spam or non-spam classes.
The study divided spams into: reviews about brands only, non-
reviews, irrelevant reviews, and general reviews. Moreover,
the inconsistency between review comments and the review
rate is examined.
In this study, we extend Mukherjee’s algorithm [4] to detect
group of spammers in social media from Arabic text.
III. FRAMEWORK
In this paper, we propose AGSD (Arabic Group Spammers
Detection) to identify and detect group spammers in social
media from Arabic text. It is based on several open sources to
deal with the Arabic texts written in the social media. AGSD
has four main phases: crawling, preprocessing, spamming
activities detection, and Individual member behavior scanning.
A. Crawling and Indexing
The proposed AGSD is based on Apache Lucene, Nutch
and Solr open sources to build a complete search engine.
Lucene is not a complete search engine but an indexing library
that is used to generate inverted index. To index web
documents, Lucene needs a crawler to collect web documents
such as Apache Nutch [2] .Nutch works under the Hadoop
framework and it supports cluster capabilities, distributed
computation, using MapReduce and a distributed file system
which makes it scalable and cost effective [5]. To improve the
system implementation, Apache Solr are used to give indexing
and searching abilities based on Lucene library [2].The
advantage of Nutch and Solr is that it can handle big data
effectively. Finally, PHP web search page created in order to
make the decision maker or any users able to search for any
target product (see Fig. 1).
Web crawlers, are systems that collect a corpus of web
pages for web indexing and querying. Web indexing, is index
processing of already downloaded pages by crawler so that
users allow to make queries.
In AGSD, the Apache Nutch web crawler is used and it is
integrated with Apache Solr. Apache Nutch is configured to
crawl the twitter website only. The AGSD crawls all the tweets
in twitter website, stores and indexes the details of some web
pages. Solr stores the important information related to each
URL such as title, URL, content of the page, anchors to other
pages, metatag description and metatag keywords. After
storing and preprocessing, the data is ready to be searched
using Solr .The next subsection will describe the preprocessing
phase.
Fig. 1 Search Page
8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)
http://dx.doi.org/10.15242/IIE.E0516011 21
Fig. 2 Solr Libraries Based on Lucene Libraries [10]
B. Preprocessing
Solr works on documents, where each document consists of
searchable fields such as id, title, content, anchor and URL.
The rules for searching each field are shaped using field type
definitions such as string, long, URL, and Arabic text. A field
type definition describes the analyzers, tokenizes and filters
which control searching behavior [10]. We have two
analyzers, one is for indexing time and the other is for
querying time. When a document is added or updated, its
fields will be analyzed and tokenized, and those tokens are
stored in the Solr’s index. When a query is issued, the query
text is again analyzed, tokenized and then matched against
tokens in the Solr’s index. This critical function of
tokenization is handled by Tokenizer components. Also, there
are TokenFilter components which modify the token stream
[10].
Solr used the following libraries that are based on Lucene
libraries as seen in Fig. 2 .The libraries that used in this study
are:
WhitespaceTokenizerFactory: it determines which tokens of
characters are separated by splitting on whitespace [17].
WordDelimiterFilterFactory: It splits words into sub-words
and performs optional transformations on sub-word groups.
One use for WordDelimiterFilter is to help match words
with different delimiters [17].
SynonymFilterFactory: It matches strings of tokens and
replaces them with other strings of tokens using an external
file to define the synonyms. For this reason, an Arabic
synonyms file is manually made by writing synonyms from
an Arabic dictionary (around 200 Arabic synonyms) as
shown in Fig. 3. When expand property is true, a synonym
will be expanded to all equivalent synonyms during query
time. E.g. = طباخطاهي [17].
StopFilterFactory: Discards common words by using
external stop-words file, which is made specifically for
Arabic language. E.g.: "حتى" and "[17] "كان.
ArabicNormalizationFilterFactory. It makes normalization
for efficiency operating on an Arabic term buffer. For
example: normalization أ to [17] ا.
ArabicAnalyzer. It implements light-stemming as specified
by ArabicStemFilter .For instance, removal of an attached
definite article, conjunction, and prepositions. In addition
to stemming of common suffixes [17].
RemoveDuplicatesTokenFilterFactory. It considers any
tokens in the same logical position in the token stream as
previous tokens and remove them. [17].
Fig. 3 Arabic Synonyms
C. Spamming Activities Detection
The spamming activities that may indicate a spam which
identified by Mukherjee et al. (2012) are:
1. Group Time Window (GTW)
This rule detects group members who post tweet on a
particular entity during a short time interval, which is
considered a spamming activity. If there are tweets on the
searched entity that posted in the same day more than π tweets
(π is a constant value), these tweets of that day are suspected
tweets. Consequently, their writers are suspected spammers of
this entity.
2. Group Content Similarity (GCS)
The similarity of the tweet’s content is to be checked against
all other tweets on the searched entity. If the similarity is more
than β (β is a constant value that represents similarity score),
then these tweets are suspected to be spam and their writers are
suspected to be spammers.
3. Group Early Time Frame (GETF)
This indicator detects if a group members are the first
people to post tweet on entity. The exact date of launching
each entity is not available. Consequently, the first review of
this entity is considered its launching date. If the tweet date is
smaller than α + λ (α is the earliest tweet date, and λ is a
constant value that represents the period of the early tweets),
then these tweets are suspected to be spam and their writers are
suspected to be spammers
D. Individual Members' Behaviors Scanning
The individual members' behavior is monitored to extract
the suspected spammers [4]. The tweets' writers of the
previous stage (in section C), are the only ones to be checked
again in this stage. The revealed writers of this stage (stage D)
are considered the suspected group of spammers of a definite
identity.
8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)
http://dx.doi.org/10.15242/IIE.E0516011 22
1. Individual Content Similarity (ICS)
Individual spammers may review a product many times
posting duplicate or near duplicate reviews to increase the
product popularity. ICS models the activity of a particular
reviewer to examine all his/her tweets on a product to find out
if there are duplicated or near duplicated tweets on this
particular entity.
2. Individual Early Time Frame (IETF)
Like GETF, this checks if a reviewer is one of the reviewers
who made the tweet during the early time of lunching a
product.
3. Individual Member Coupling in a Group (IMC)
IMC detects if members worked with other members of the
group in posting faked reviews at the same date about the same
product. This is considered tightly coupled members which
indicate group spamming.
IV. DISCUSSION
This research is still at an early stage and many issues need
to be considered. For example, how to evaluate the proposed
approach is a challenge and needs more explanation. In our
context, a keyword about a given product can be entered to
retrieve relevant tweets about the target product. For example,
when searching for شركة االتصاالت, the proposed system obtains
all the tweets relevant to the subject. The potential group of
spammers who tweets on this entity is identified by applying
group and individual indicators as explained in the previous
sections.
To evaluate the proposed model, we intend to write some
fake and genuine tweets on a specific entity taking into
account several sets of unusual behaviors like those mentioned
in sections C and D. Then, the written tweets will be injected
into Solr. Consequentially, all the tweets including the fakes
we deliberately injected will be retrieved during the search for
the target entity. Proposed AGSD system will identify and
detect the group of spammers that wrote faked reviews to
proofs its feasibility .The accuracy of our proposed can be
determined by calculating recall and precision of our proposed
method.
V. CONCLUSION
Nowadays, individuals and organizations surf social media
to make purchasing decision. Because of this, spammers may
add faked reviews to promote and/or slander target products
[8]. For this reason, it is a must to detect opinion spamming to
ensure that social media remains a trusted reliable source of
public opinions. [11].This study is aim of identifying and
discovering the group spammers who post fake and unreliable
tweets.
This is a search engine for group spammers' detection
system which can search for Arabic product tweets and
identify group of spammers. It is based on group and
individual spam behavior indicators.
REFERENCES
[1] A. Abu Hammad, "An Approach for Detecting Spam in Arabic Opinion
Reviews," January 2013.
[2] A. Atreya, S. Chaudhari1, P. Bhattacharyya1 and G. Ramakrishnan,
"Building Multilingual Search Index using open," in Proceedings of the
3rd Workshop on South and Southeast Asian Natural Language
Processing (SANLP), pages 201–210,COLING 2012, Mumbai,
December 2012.
[3] A. Mukherjee, . B. Liu, J. Wang, . N. Glance and N. Jindal, "Detecting
Group Review Spam," in 20th international conference companion on
World wide web ,2011.
[4] A. Mukherjee, B. Liu and N. Glance, "Spotting Fake Reviewer Groups
in Consumer Reviews," in In Proceedings of the 21st international
conference on World Wide Web,2012.
http://dx.doi.org/10.1145/2187836.2187863
[5] A. Ricardo and C. Serrão, "Comparison of existing open-source tools
for Web crawling and indexing of free music,"
TELECOMMUNICATIONS, vol. 18, no. 1, JANUARY 2013.
[6] C. G. Harris, "Detecting Deceptive Opinion Spam Using Human
Computation," in AAAI Technical Report WS-12-08, 2012.
[7] D. Meyer, "Fake reviews prompt Belkin apology," [Online]. Available:
http://news.cnet.com/8301-1001_3-10145399-92.html.
[8] F. Li, M. Huang, Y. Yang and X. Zhu, "Learning to identify review
spam," in In IJCAI Proceedings-International Joint Conference on
Artificial Intelligence, 2011.
[9] H. A. Wahsheh, M. N. Al-Kabi and I. M. Alsmadi, "SPAR: A system to
detect spam in Arabic opinions," in In Applied Electrical Engineering
and Computing Technologies (AEECT), 2013.
http://dx.doi.org/10.1109/aeect.2013.6716442
[10] K. Shiraly, "Solr text field types, analyzers, tokenizers & filters
explained," 2010. [Online]. Available:
http://www.pathbreak.com/blog/solr-text-field-types-analyzers-
tokenizers-filters-explained.
[11] L. Bing , Sentiment Analysis and Opinion Mining, Synthesis Lectures
on Human Language Technologie, April 22, 2012, pp. 1-167.
[12] L. Yuming , . Z. Tao, H. WuI, . Z. Jingwei, W. Xiaoling and A. Zhou,
"Towards online anti-opinion spam: Spotting fake reviews from the
review sequence," in 2014 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM 2014),
2014.
[13] N. Jindal and B. Liu, "Opinion Spam and Analysis," in In Proceedings
of the 2008 International Conference on Web Search and Data Mining
(pp. 219-230).
http://dx.doi.org/10.1145/1341531.1341560
[14] R. Nithish, M. Navaneeth Kishen, S. Sabarish, A. M. Abirami and A.
Askarunisa, "An Ontology based Sentiment Analysis for mobile
products using tweets," in In Advanced Computing (ICoAC), 2013 Fifth
International Conference , 2013
http://dx.doi.org/10.1109/icoac.2013.6921974.
[15] R. T. Khasawneh, H. A. Wahsheh, M. N. AI-Kabi and I. M. Aismadi,
"Sentiment analysis of arabic social media content: a comparative
study," in The 8th International Conference for Internet Technology
and Secured Transactions (ICITST-2013), 2013.
[16] S. Xie, G. Wang, S. Lin and P. Yu, "Review Spam Detection via
Temporal Pattern Discovery," in 18th ACM SIGKDD international
conference on Knowledge discovery and data mining,2012.
http://dx.doi.org/10.1145/2339530.2339662
[17] SolrWiki, "Analyzers, Tokenizers, and Token Filters," 2015. [Online].
Available:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Rem
oveDuplicatesTokenFilterFactory.
[18] X. Zhou, X. Tao, J. Yong and Z. Yang, "Sentiment analysis on tweets
for social event," in Proceedings of the 2013 IEEE 17th International
Conference on Computer Supported Cooperative Work in Design,
2013.
http://dx.doi.org/10.1109/cscwd.2013.6581022
[19] Y. Ma and F. Li, "Detecting review spam: Challenges and
opportunities," in In Collaborative Computing: Networking,
Applications and Worksharing, 2012 8th International Conference,
2012.
http://dx.doi.org/10.4108/icst.collaboratecom.2012.250640
8th International Conference on Latest Trends in Engineering and Technology (ICLTET'2016) May 5-6 2016 Dubai (UAE)
http://dx.doi.org/10.15242/IIE.E0516011 23