ieee transactions on systems, man, and … identifying and ranking prevalent news...2 ieee...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1

SociRank: Identifying and Ranking Prevalent NewsTopics Using Social Media Factors

Derek Davis, Gerardo Figueroa, and Yi-Shin Chen

Abstract—Mass media sources, specifically the news media,have traditionally informed us of daily events. In modern times,social media services such as Twitter provide an enormousamount of user-generated data, which have great potential tocontain informative news-related content. For these resources tobe useful, we must find a way to filter noise and only capturethe content that, based on its similarity to the news media, isconsidered valuable. However, even after noise is removed, infor-mation overload may still exist in the remaining data—hence,it is convenient to prioritize it for consumption. To achieve pri-oritization, information must be ranked in order of estimatedimportance considering three factors. First, the temporal preva-lence of a particular topic in the news media is a factor ofimportance, and can be considered the media focus (MF) ofa topic. Second, the temporal prevalence of the topic in socialmedia indicates its user attention (UA). Last, the interactionbetween the social media users who mention this topic indi-cates the strength of the community discussing it, and can beregarded as the user interaction (UI) toward the topic. Wepropose an unsupervised framework—SociRank—which identi-fies news topics prevalent in both social media and the newsmedia, and then ranks them by relevance using their degreesof MF, UA, and UI. Our experiments show that SociRankimproves the quality and variety of automatically identified newstopics.

Index Terms—Information filtering, social computing, socialnetwork analysis, topic identification, topic ranking.

I. INTRODUCTION

THE mining of valuable information from online sourceshas become a prominent research area in informa-

tion technology in recent years. Historically, knowledge thatapprises the general public of daily events has been providedby mass media sources, specifically the news media. Many ofthese news media sources have either abandoned their hard-copy publications and moved to the World Wide Web, ornow produce both hard-copy and Internet versions simulta-neously. These news media sources are considered reliable

Manuscript received July 21, 2015; revised September 14, 2015; acceptedOctober 17, 2015. This work was supported by the Ministry of Scienceand Technology, China, under Grant MOST104-2221-E-007-136 and GrantMOST103-2221-E-007-092. This paper was recommended by AssociateEditor F. Wang.

D. Davis and G. Figueroa are with the Institute of Information Systems andApplications, National Tsing Hua University, Hsinchu 30013, Taiwan (e-mail:[email protected]; [email protected]).

Y.-S. Chen is with the Department of Computer Science, National TsingHua University, Hsinchu 30013, Taiwan (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSMC.2016.2523932

because they are published by professional journalists, whoare held accountable for their content. On the other hand,the Internet, being a free and open forum for informationexchange, has recently seen a fascinating phenomenon knownas social media. In social media, regular, nonjournalist usersare able to publish unverified content and express their interestin certain events.

Microblogs have become one of the most popular socialmedia outlets. One microblogging service in particular,Twitter, is used by millions of people around the world, pro-viding enormous amounts of user-generated data. One mayassume that this source potentially contains information withequal or greater value than the news media, but one must alsoassume that because of the unverified nature of the source,much of this content is useless. For social media data to beof any use for topic identification, we must find a way tofilter uninformative information and capture only informationwhich, based on its content similarity to the news media, maybe considered useful or valuable.

The news media presents professionally verified occurrencesor events, while social media presents the interests of theaudience in these areas, and may thus provide insight intotheir popularity. Social media services like Twitter can alsoprovide additional or supporting information to a particularnews media topic. In summary, truly valuable information maybe thought of as the area in which these two media sourcestopically intersect. Unfortunately, even after the removal ofunimportant content, there is still information overload in theremaining news-related data, which must be prioritized forconsumption.

To assist in the prioritization of news information, newsmust be ranked in order of estimated importance. The tempo-ral prevalence of a particular topic in the news media indicatesthat it is widely covered by news media sources, making itan important factor when estimating topical relevance. Thisfactor may be referred to as the MF of the topic. The tem-poral prevalence of the topic in social media, specifically inTwitter, indicates that users are interested in the topic andcan provide a basis for the estimation of its popularity. Thisfactor is regarded as the UA of the topic. Likewise, the num-ber of users discussing a topic and the interaction betweenthem also gives insight into topical importance, referred to asthe UI. By combining these three factors, we gain insight intotopical importance and are then able to rank the news topicsaccordingly.

Consolidated, filtered, and ranked news topics fromboth professional news providers and individuals have

2168-2216 c© 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

mailto:[email protected]



http://ieeexplore.ieee.org

http://www.ieee.org/publications_standards/publications/rights/index.html


2 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

several benefits. The most evident use is the potential toimprove the quality and coverage of news recommendersystems or Web feeds, adding user popularity feedback.Additionally, news topics that perhaps were not perceived aspopular by the mass media could be uncovered from socialmedia and given more coverage and priority. For instance, aparticular story that has been discontinued by news providerscould be given resurgence and continued if it is still a pop-ular topic among social networks. This information, in turn,can be filtered to discover how particular topics are discussedin different geographic locations, which serve as feedback forbusinesses and governments.

A straightforward approach for identifying topics fromdifferent social and news media sources is the applicationof topic modeling. Many methods have been proposed inthis area, such as latent Dirichlet allocation (LDA) [1] andprobabilistic latent semantic analysis (PLSA) [2], [3]. Topicmodeling is, in essence, the discovery of “topics” in textcorpora by clustering together frequently co-occurring words.This approach, however, misses out in the temporal compo-nent of prevalent topic detection, that is, it does not takeinto account how topics change with time. Furthermore,topic modeling and other topic detection techniques do notrank topics according to their popularity by taking intoaccount their prevalence in both news media and socialmedia.

We propose an unsupervised system—SociRank—whicheffectively identifies news topics that are prevalent in bothsocial media and the news media, and then ranks them by rel-evance using their degrees of MF, UA, and UI. Even thoughthis paper focuses on news topics, it can be easily adaptedto a wide variety of fields, from science and technology toculture and sports. To the best of our knowledge, no otherwork attempts to employ the use of either the social mediainterests of users or their social relationships to aid in theranking of topics. Moreover, SociRank undergoes an empiri-cal framework, comprising and integrating several techniques,such as keyword extraction, measures of similarity, graph clus-tering, and social network analysis. The effectiveness of oursystem is validated by extensive controlled and uncontrolledexperiments.

To achieve its goal, SociRank uses keywords from newsmedia sources (for a specified period of time) to identify theoverlap with social media from that same period. We then builda graph whose nodes represent these keywords and whoseedges depict their co-occurrences in social media. The graph isthen clustered to clearly identify distinct topics. After obtain-ing well-separated topic clusters (TCs), the factors that signifytheir importance are calculated: MF, UA, and UI. Finally, thetopics are ranked by an overall measure that combines thesethree factors.

The remainder of this paper is organized as follows.Section II reviews previous work on topic identification andother research areas that are implemented in our method,including keyword extraction, co-occurrence similarity mea-sures, graph clustering, topic ranking, and social network anal-ysis. Section III describes the overall framework of SociRankand all of its stages. Section IV presents the setup and results

of our experiments. Finally, we provide the conclusions andfuture work in Section V.

II. RELATED WORK

The main research areas applied in this paper include: topicidentification, topic ranking social, network analysis, keywordextraction, co-occurrence similarity measures, and graph clus-tering. Extensive work has been conducted in most of theseareas.

A. Topic Identification

Much research has been carried out in the field of topicidentification—referred to more formally as topic modeling.Two traditional methods for detecting topics are LDA [1] andPLSA [2], [3]. LDA is a generative probabilistic model thatcan be applied to different tasks, including topic identification.PLSA, similarly, is a statistical technique, which can also beapplied to topic modeling. In these approaches, however, tem-poral information is lost, which is paramount in identifyingprevalent topics and is an important characteristic of socialmedia data. Furthermore, LDA and PLSA only discover top-ics from text corpora; they do not rank based on popularity orprevalence.

Wartena and Brussee [4] implemented a method todetect topics by clustering keywords. Their method entailsthe clustering of keywords—based on different similaritymeasures—using the induced k-bisecting clustering algo-rithm [5]. Although they do not employ the use of graphs,they do observe that a distance measure based on theJensen–Shannon divergence (or information radius [6]) ofprobability distributions performs well.

More recently, research has been conducted in identifyingtopics and events from social media data, taking into accounttemporal information. Cataldi et al. [7] proposed a topic detec-tion technique that retrieves real-time emerging topics fromTwitter. Their method uses the set of terms from tweets andmodel their life cycle according to a novel aging theory.Additionally, they take into account social relationships—morespecifically, the authority of the users in the network—to deter-mine the importance of the topics. Zhao et al. [8] carried outsimilar work by developing a Twitter-LDA model designed toidentify topics in tweets. Their work, however, only considersthe personal interests of users, and not prevalent topics at aglobal scale.

Another trending area of related research is the detectionof “bursty” topics (i.e., topics or events that occur in short,sudden episodes). Diao et al. [9] proposed a method thatuses a state machine to detect bursty topics in microblogs.Their method also determines whether user posts are per-sonal or refer to a particular trending topic. Yin et al. [10]also developed a model that detects topics from social mediadata, distinguishing between temporal and stable topics. Thesemethods, however, only use data from microblogs and donot attempt to integrate them with real news. Additionally,the detected topics are not ranked by popularity orprevalence.


DAVIS et al.: SOCIRANK: IDENTIFYING AND RANKING PREVALENT NEWS TOPICS USING SOCIAL MEDIA FACTORS 3

B. Topic Ranking

Another major concept that is incorporated into this paperis topic ranking. There are several means by which this taskcan be accomplished, traditionally being done by estimatinghow frequently and recently a topic has been reported by massmedia.

Wang et al. [11] proposed a method that takes into accountthe users’ interest in a topic by estimating the amount of timesthey read stories related to that particular topic. They refer tothis factor as the UA. They also used an aging theory devel-oped by Chen et al. [12] to create, grow, and destroy a topic.The life cycles of the topics are tracked by using an energyfunction. The energy of a topic increases when it becomespopular and it diminishes over time unless it remains pop-ular. We employ variants of the concepts of MF and UAto meet our needs, as these concepts are both logical andeffective.

Other works have made use of Twitter to discovernews-related content that might be considered important.Sankaranarayanan et al. [13] developed a system calledTwitterStand, which identifies tweets that correspond to break-ing news. They accomplish this by utilizing a clusteringapproach for tweet mining. Phelan et al. [14] developeda recommendation system that generates a ranked list ofnews stories. News are ranked based on the co-occurrenceof popular terms within the users’ RSS and Twitter feeds.Both of these systems aim to identify emerging topics, butgive no insight into their popularity over time. Moreover,the work by Phelan et al. [14] only produces a personal-ized ranking (i.e., news articles tailored specifically to thecontent of a single user), rather than providing an over-all ranking based on a sample of all users. Nevertheless,these works provide us with a basis for extending thepremise of UA.

Research has also been carried out in topic discovery andranking from other domains. Shubhankar et al. [15] devel-oped an algorithm that detects and ranks topics in a corpusof research papers. They used closed frequent keyword-setsto form topics and a modification of the PageRank [16] algo-rithm to rank them. Their work, however, does not integrateor collaborate with other data sources, as accomplished bySociRank.

C. Social Network Analysis

In the case of UA, Wang et al. [11] estimated this factor byusing anonymous website visitor data. Their method counts theamount of times a site was visited during a particular period oftime, which represents the UA of the topic to which the site isrelated. Our belief, on the other hand, is that, although websiteusage statistics provide initial proof of attention, additionaldata are needed to corroborate it. We employ the use of socialmedia, specifically Twitter, as a means to estimate UA. Whena user tweets about a particular topic, it signifies that the useris interested in the topic and it has captured her attention moreso than visiting a website related to it. In summary, visiting awebsite might be the initial stimulus, but taking the additional

step of discussing a topic via social media signifies genuineattention.1

Additionally, we believe that the relationship between socialmedia users who discuss the same topics also plays a keyrole in topic relevance. Kwan et al. [17] proposed a measurereferred to as reciprocity, which attempts to detect the interac-tion between social media users and perceive their engagementin relation to a particular topic. Higher reciprocity meansgreater interaction between users, and thus topics with higherreciprocity should be considered more important because oftheir underlying community structure. We can inherently iden-tify the power and influence of a well-structured community asopposed to a decentralized and unstructured one. Our methodapplies this logic to support the idea that higher reciprocitysignifies greater importance.

D. Keyword Extraction

Concerning the field of keyword or informative term extrac-tion, many unsupervised and supervised methods have beenproposed. Unsupervised methods for keyword extraction relysolely on implicit information found in individual texts or ina text corpus. Supervised methods, on the other hand, makeuse of training datasets that have already been classified.

Among the unsupervised methods, there are those thatemploy statistical measures of term informativeness or rel-evance, such as term specificity [18], TFIDF [19], wordfrequency [20], n-grams [21], and word co-occurrence [22].Other unsupervised approaches are graph-based, where a textis converted into a graph whose nodes represent text units (e.g.,words, phrases and sentences) and whose edges represent therelationships between these units. The graph is then recur-sively iterated and relevance scores are assigned to each nodeusing different approaches. A popular example of a graph-based keyword extraction method is TextRank, proposed byMihalcea and Tarau [23]; it utilizes the premise of the popularPageRank [16] algorithm.

There has also been much work on keyword extraction usingsupervised and hybrid approaches. Two traditional supervisedframeworks are KEA [24] and GenEx [25], which use machinelearning algorithms for the effective extraction of keywords.Other innovative approaches for keyword extraction have beenproposed in recent years, including the application of neuralnetworks [26]–[28] and conditional random fields [29]. Hybridmethods (i.e., methods that make use of unsupervised andsupervised components) have been proposed as well, such asHybridRank [30], which makes use of collaboration betweenthe two approaches.

Due to its simple implementation, we use TextRank [23] toextract keywords from the news media sources. Furthermore,TextRank does not require training or any document corpusfor its operation.

E. Co-Occurrence Similarity

Matsuo and Ishizuka [22] suggested that the co-occurrencerelationship of frequent word pairs from a single document

1The works by Sankaranarayanan et al. [13] and Phelan et al. [14] supportthis premise.



may provide statistical information to aid in the identificationof the document’s keywords. They proposed that if the prob-ability distribution of co-occurrence between a term x and allother terms in a document is biased to a particular subset offrequent terms, then term x is likely to be a keyword. Eventhough our intention is not to employ co-occurrence for key-word extraction, this hypothesis emphasizes the importance ofco-occurrence relationships.

Chen et al. [31] proposed a novel co-occurrence sim-ilarity measure in which they measure the association ofterms using snippets returned by Web searches. They referto this measure as co-occurrence double checking (CODC).Bollegala et al. [32] proposed a method that uses pagecounts and text snippets from Web searches to measure thesimilarity between words or entities. They compared theirmethod with CODC [31], as well as with variants of sev-eral other co-occurrence similarity measures, such as theoverlap (Simpson) [33], Dice [34], point-wise mutual infor-mation (PMI) [35], Jaccard [36], and cosine similarities.

Since establishing the importance of the word-pair co-occurrence distribution in the actual corpus of tweets is ofmore interest to us, we did not employ Bollegala’s or Chen’ssemantic similarity methods. In this paper, we tested other sim-ilarity measures, and found that the Dice similarity measureprovided the best results.

F. Graph Clustering

The main purpose of graph clustering in this paper is toidentify and separate TCs, as done in Wartena and Brussee’swork [4]. Iwasaka and Tanaka-Ishii [37] also proposed amethod that clusters a co-occurrence graph based on a graphmeasure known as transitivity. The basic idea of transitiv-ity is that in a relationship between three elements, if therelationship holds between the first and second elements andbetween the second and third elements, it also holds betweenthe first and third elements. They suggested that each out-put cluster is expected to have no ambiguity, and that thisis only achieved when the edges of a graph (representingco-occurrence relations) are transitive.

Matsuo et al. [38] employed a different approach to achievethe clustering of co-occurrence graphs. They used Newmanclustering [39] to efficiently identify word clusters. Thecore idea behind Newman clustering is the concept of edgebetweenness. The betweenness measure of an edge is thenumber of shortest paths between pairs of nodes that runalong it. If a network contains clusters that are loosely con-nected by a few intercluster edges, then all shortest pathsbetween different clusters must go along one of these edges.Consequently, the edges connecting different clusters will havehigh edge betweenness, and removing them iteratively willyield well-defined clusters.

Newman realized, however, that the main disadvantage ofthe algorithm was its high computational demand, and thusproposed a new method to identify clusters based on mod-ularity [40]. Modularity is a measure designed to estimatethe strength of division of a network into clusters. Networksthat possess a high modularity value have dense connections

between the nodes within each cluster, but sparse connectionsbetween nodes in different clusters. Newman’s new algorithmcalculated modularity as it progressed, making it simple tofind the optimal clustering structure.

Given their effectiveness, the concepts of betweennessand transitivity are both applied into our graph clusteringalgorithm.

III. SOCIRANK FRAMEWORK

The goal of our method—SociRank—is to identify, consoli-date and rank the most prevalent topics discussed in both newsmedia and social media during a specific period of time. Thesystem framework can be visualized in Fig. 1. To achieve itsgoal, the system must undergo four main stages.

1) Preprocessing: Key terms are extracted and filtered fromnews and social data corresponding to a particular periodof time.

2) Key Term Graph Construction: A graph is constructedfrom the previously extracted key term set, whose ver-tices represent the key terms and edges represent theco-occurrence similarity between them. The graph, afterprocessing and pruning, contains slightly joint clustersof topics popular in both news media and social media.

3) Graph Clustering: The graph is clustered in order toobtain well-defined and disjoint TCs.

4) Content Selection and Ranking: The TCs from the graphare selected and ranked using the three relevance factors(MF, UA, and UI).

Initially, news and tweets data are crawled from the Internetand stored in a database. News articles are obtained from spe-cific news websites via their RSS feeds and tweets are crawledfrom the Twitter public timeline [41]. A user then requests anoutput of the top k ranked news topics for a specified periodof time between date d1 (start) and date d2 (end).

A. Preprocessing

In the preprocessing stage, the system first queries all newsarticles and tweets from the database that fall within date d1and date d2. Additionally, two sets of terms are created: one forthe news articles and one for the tweets, as explained below.

1) News Term Extraction: The set of terms from thenews data source consists of keywords extracted from all thequeried articles. Due to its simple implementation and effec-tiveness, we implement a variant of the popular TextRankalgorithm [23] to extract the top k keywords from each newsarticle.2 The selected keywords are then lemmatized using theWordNet lemmatizer in order to consider different inflectedforms of a word as a single item. After lemmatization, allunique terms are added to set N. It is worth pointing out that,since N is a set, it does not contain duplicate terms.

2) Tweets Term Extraction: For the tweets data source, theset of terms are not the tweets’ keywords, but all unique andrelevant terms. First, the language of each queried tweet isidentified, disregarding any tweet that is not in English. Fromthe remaining tweets, all terms that appear in a stop word

2Terms that appear in a predefined stop word list or that are less than threecharacters in length are not taken into consideration.



Fig. 1. SociRank framework.

list or that are less than three characters in length are elimi-nated. The part of speech (POS) of each term in the tweets isthen identified using a POS tagger [42]. This POS tagger isespecially useful because it can identify Twitter-specific POSs,such as hashtags, mentions, and emoticon symbols.

Hashtags are of great interest to us because of their poten-tial to hold the topical focus of a tweet. However, hashtagsusually contain several words joined together, which must besegmented in order to be useful. To solve this problem, wemake use of the Viterbi segmentation algorithm [43]. Thesegmented terms are then tagged as “hashtag.”

To eliminate terms that are not relevant, only terms taggedas hashtag, noun, adjective or verb are selected. The termsare then lemmatized and added to set T , which represents allunique terms that appear in tweets from dates d1 to d2.

B. Key Term Graph Construction

In this component, a graph G is constructed, whose clusterednodes represent the most prevalent news topics in both newsand social media. The vertices in G are unique terms selectedfrom N and T , and the edges are represented by a relationshipbetween these terms. In the following sections, we define amethod for selecting the terms and establish a relationshipbetween them. After the terms and relationships are identified,the graph is pruned by filtering out unimportant vertices andedges.

1) Term Document Frequency: First, the document fre-quency of each term in N and T is calculated accordingly.In the case of term set N, the document frequency of eachterm n is equal to the number of news articles (from dates d1to d2) in which n has been selected as a keyword; it is repre-sented as df(n). The document frequency of each term t in setT is calculated in a similar fashion. In this case, however, it

is the number of tweets in which t appears; it is representedas df(t). For simplification purposes, we will henceforth referto the document frequency as “occurrence.” Thus, df(n) is theoccurrence of term n and df(t) is the occurrence of term t.

2) Relevant Key Term Identification: Let us recall thatset N represents the keywords present in the news and setT represents all relevant terms present in the tweets (fromdates d1 to d2). We are primarily interested in the importantnews-related terms, as these signal the presence of a news-related topic. Additionally, part of our objective is to extractthe topics that are prevalent in both news and social media.To achieve this, a new set I is formed

I = N ∩ T. (1)

This intersection of N and T eliminates terms from T thatare not relevant to the news and terms from N that are notmentioned in the social media.

Set I, however, still contains many potentially unimportantterms. To solve this problem, terms in I are ranked based ontheir prevalence in both sources. In this case, prevalence isinterpreted as the occurrence of a term, which in turn is theterm’s document frequency. The prevalence of a term is thusa combination of its occurrence in both N and T . Prevalencep of each term i in I is calculated such that half of its weightis based on the occurrence of the term in the news media, andthe other half is based on its occurrence in social media

∀i ∈ I : p(i) = df(n) × |T||N| + df(t)

2|T| (2)

where |T| is the total number of tweets selected between datesd1 and d2, and |N| is the total number of news articles selectedin the same time period.



The terms in set I are then ranked by their prevalence value,and only those in the top π th percentile are selected. Using aπ value of 75 presented the best results in our experiments.We define the newly filtered set Itop using set-builder notation

Itop ={

i ∈ I :|Pi||I| × 100 > π

}(3)

where Pi = { j ∈ I : p( j) < p(i)} (4)

where |Pi| is the number of elements in subset Pi, which inturn represents the terms in I with a lower prevalence valuethan that of term i, and |I| is the total number of elementsin set I. Itop now represents the subset of top key terms fromdate d1 to date d2, taking into account their prevalence in bothnews and social media.

3) Key Term Similarity Estimation: Next, we must identifya relationship between the previously selected key terms inorder to add the graph edges. The relationship used is the termco-occurrence in the tweet term set T . The intuition behind theco-occurrence is that terms that co-occur frequently are relatedto the same topic and may be used to summarize and representit when grouped.

We define co-occurrence as two terms occurring in the sametweet. If term i ∈ Itop and term j ∈ Itop both appear in the sametweet, their co-occurrence is set to 1. For each additional tweetin which i and j appear together, their co-occurrence is incre-mented by 1. Itop is iterated through and the co-occurrence foreach term pair {i, j} is found, defined as co(i, j). The term-pairco-occurrence is then used to estimate the similarity betweenterms.

As explained in our related work, several similaritymeasures were tested in our experiments, namely over-lap [33], Dice [34], PMI [35], Jaccard [36], and cosine.Dice similarity—a variant of cosine similarity and Jaccardsimilarity—provided the best results in our experiments.

The Dice similarity between terms i and j is calculated asfollows:

dice_QS(i, j) ={

0 if co(i, j) ≤ ϑ2×co(i,j)

dftop(i)+dftop( j) otherwise (5)

where dftop(i) is the number of tweets that contain term i ∈Itop, dftop( j) is the number of tweets that contain term j ∈ Itop,and co(i, j) is the number of tweets in which terms i and jco-occur in Itop. ϑ is a threshold used to discard quotientsof similarity that fall below it. Given the scale of and noisein social media data, it is possible that a pair of terms co-occurs purely by chance. In order to reduce the adverse effectsof these co-occurrences, the quotient of similarity QS of twoterms is set to zero if their co-occurrence value is less than ϑ .In our experiments, we set ϑ to 5, though this can be adjustedas needed.

For elaboration, we also show how the variant of the Jaccardsimilarity measure is computed (as exhibited in [32])

jacc_QS(i, j) ={

0 if co(i, j) ≤ ϑco(i,j)

dftop(i)+dftop( j)−co(i,j) otherwise. (6)

Fig. 2. PDF of a typical set of QS values in Qtop.

Finally, the variant of cosine similarity measure describedby Chen et al. [31] is defined by the following equation:

cosine_QS(i, j) =⎧⎨⎩

0 if co(i, j) ≤ ϑco(i, j)√

dftop(i) × dftop( j)otherwise. (7)

All of the previously described similarity measures producea value between 0 and 1. Additionally, all QS values lessthan 0.01 are disregarded in order to reduce the effects ofco-occurrences that are considered insignificant.

The vertices of the graph are now defined as key termsthat belong to set Itop and the edges that connect them aredefined as the co-occurrence of the terms in the tweet dataset.Using the terms’ occurrence and co-occurrence values in thetweets, the relationship between vertices is further normalizedby using a coefficient of similarity to represent an edge. Wehenceforth refer to the QS values of all term-pair combinationsin Itop as set Qtop.

4) Outlier Detection: Even though many potentially unim-portant terms have been excluded thus far, there are stilltoo many terms (vertices) and co-occurrences (edges) in thegraph. We wish to capture only the most significant term co-occurrences, that is, those with sufficiently high QS values. Toidentify significant edges in the graph, irregular co-occurrencevalues (outliers) must be differentiated from regular ones.Fig. 2 illustrates the probability density function (PDF) of atypical set of QS values in Qtop. This particular distributionhas 8444 values.

It can be observed that most QS values lie near or belowthe mean, with those that are several standard distributionsaway from the mean being the most interesting ones. Thesevalues are considered outliers (i.e., they fall outside of theoverall pattern of the rest of the data). We have tested severaloutlier detection methods and found that using the interquartilerange (IQR) works well. The IQR of a given set of values is theunit difference between the third (Q3) and first (Q1) quartiles,as computed

IQR = Q3 − Q1. (8)

In other words, the IQR indicates how dispersed the mid-dle 50% of the data are. As Hubert and Van der Veeken [44]described, the IQR can be used to aid in the detection of out-liers by multiplying it by a coefficient of 1.5 and adding itto the third quartile or subtracting it from the first quartile.



Fig. 3. Ambiguous TCs in graph G.

In data that is normally distributed, the IQR would simply bemultiplied by 1.5 in the equation. However, as can be seenin Fig. 2, our data are right-skewed, and thus an additionalfactor c must be included in the calculation to compensate forit. Even though Hubert and Van der Veeken [44] provided acomplex method to compute this coefficient, for our needs wesimply test several constant values and choose the one thatyields the best results.

The set of outliers O is defined as

O ={

Q1 − 1.5c × IQR if outliers < Q1Q3 + 1.5c × IQR if outliers > Q3

(9)

where c is the coefficient by which 1.5 is multiplied. In ourexperiments, setting c = 4 provided the best results. We areonly interested in outliers above the mean, so only the secondcondition of the previous formula is applied.

In the particular QS distribution of Qtop mentioned before,after applying the aforementioned outlier-detection method,only 301 values remain (of the original 8444). These 301values represent the most significant term-pair combinationsin Qtop. We henceforth refer to these pair combinations asQsig; their individual terms are similarly referred to as Isig. Thevertices in graph G are now the terms from Isig; its edges arethe term pairs from Qsig, with their QS values being the edgeweights. Fig. 3 shows an example of two clearly identifiableclusters in G that represent unique topics.

In some cases, as shown in Fig. 3, two or more TCs (sub-graphs) appear to be joined because they share similar terms.In the figure, a topic related to the Amanda Knox murdertrial is represented by the terms amanda, knox, murder, italy,conviction, extradition, meredith, and sollecito. Another topic,related to U.S. Governor Chris Christie’s scandal, is repre-sented by the terms chris, christie, govenor, lawyer, evidence,wildstien, lane, and authority. Although the term evidence isconnected to the Chris Christie scandal, it is also connectedto the term knox (and is also related to that topic). The termevidence, however, is connected to more vertices related to

the Chris Christie scandal, and may be perceived as moreimportant to this topic.

As seen in the previous example, some terms can belongto two or more topics, which leaves us with two options:1) allowing terms to belong to multiple topics (overlappingclusters) or 2) allowing terms to be belong to a single topic(nonoverlapping clusters). Since our goal is to clearly identifyunambiguous TCs, we employ the second option. Furthermore,we hypothesize that for a specific moment in time, and giventhat the terms in Isig are highly informational and specific, theywill be the most representative for a single topic. We musttherefore find a way to eliminate the edges that cause sometopics to be ambiguous and increase the cluster quality of G.

C. Graph Clustering

Once graph G has been constructed and its most significantterms (vertices) and term-pair co-occurrence values (edges)have been selected, the next goal is to identify and separatewell-defined TCs (subgraphs) in the graph. Before explainingthe graph clustering algorithm, the concepts of betweennessand transitivity must first be understood.

1) Betweenness: Matsuo et al. [38] proposed an efficientapproach to achieve the clustering of co-occurrence graphs.They use a graph clustering algorithm called Newman clus-tering [39] to efficiently identify word clusters. The core ideabehind Newman clustering is the concept of edge betweenness.The betweenness value of an edge is the number of shortestpaths between pairs of nodes that run along it. If a networkcontains clusters that are loosely connected by a few interclus-ter edges, then all shortest paths between the different clustersmust go along these edges. Consequently, the edges connect-ing the clusters will have high edge betweenness. Removingthese edges iteratively should thus yield well-defined clusters.

As outlined by Brandes [45], the betweenness measure ofan edge e is calculated as follows:

betweenness(e) =∑i,j∈V

σ(i, j|e)σ (i, j)

(10)

where V is the set of vertices, σ(i, j) is the number of shortestpaths between vertex i and vertex j, and σ(i, j|e) is the num-ber of those paths that pass through edge e. By convention,0/0 = 0.

2) Transitivity: Iwasaka and Tanaka-Ishii [37] developed amethod that accomplishes the clustering of a co-occurrencegraph based on a concept known as transitivity. Transitivityis a property in a relation between three elements such thatif the relation holds between the first and second elements,and between the second and third elements, then it also holdsbetween the first and third elements. The authors indicate thateach output cluster is expected to have no ambiguity, and thatthis is only achieved when a graph’s edges—representing co-occurrence relations—are transitive. Thus, a graph with higherglobal transitivity is considered to have better cluster qualitythan one with lower global transitivity. The transitivity of agraph G is defined as

transitivity(G) = #triangles

#triads(11)



Algorithm 1 Improve the Cluster Quality of a Graph1: Input: Graph G2: Output: Cluster-quality-improved G3: B = {} � empty set4: repeat5: for all (edge e ∈ G) do6: Calculate betweenness(e) and append to B7: end for8: if first iteration of loop then9: bavg = avg(B)

10: end if11: bmax = max(B)12: trans0 = transitivity(G) � previous transitivity13: Remove edge with bmax from G14: trans1 = transitivity(G) � posterior transitivity15: Clear set B16: until (trans1 < trans0 or bmax < bavg)17: Add edge with bmax to G

where #triangles is the number of complete triangles (i.e.,complete size-three subgraphs) in G and #triads is the numberof triads (i.e., edge pairs connected to a shared vertex).

3) Graph Clustering Algorithm: We apply the concepts ofbetweenness and transitivity in our graph clustering algorithm,which disambiguates potential topics. The process is outlinedin Algorithm 1.

First, the betweenness values of all edges in graph G arecalculated in lines 5–7. Then, the initial average betweennessof graph G is calculated in line 9; we wish for all edges toapproach this betweenness. To achieve this, edges with highbetweenness values are iteratively removed in order to separateclusters in the graph (line 13). It is worth pointing out thatset B, which keeps track of all betweenness values in the graph,is emptied at the end of each iteration.

The edge-removing process is stopped when removing addi-tional edges yields no benefit to the clustering quality ofthe graph. This occurs: 1) when the transitivity of G afterremoving an edge is lower than it was before or 2) when themaximum edge betweenness is less than the average—the lat-ter indicating that the maximum betweenness is considerednormal when compared to the average. Once the process hasbeen stopped, the last-removed edge is added back to G.

Fig. 4 shows two clearly distinct and unambiguous TCsafter the graph clustering algorithm has been run. It can beobserved that the edge that connected the terms knox and evi-dence (in addition to other edges) has been removed. Althoughthe accuracy of the algorithm is not 100% (i.e., a few topicsmay remain ambiguous), its performance is satisfactory forour needs.

4) Time Complexity Analysis of Graph Clustering: In thissection, we provide a brief time complexity analysis ofthe graph clustering algorithm described above. Our initialassumption is that, after performing the outlier detection step(Section III-B4) to obtain a reduced set of edges, graph Gmay be considered a sparse graph. In other words, the number

Fig. 4. Unambiguous TCs in graph G after running the cluster-quality-improvement algorithm.

of edges in G is approximately the same as the number ofvertices, or |E| ≈ |V|.

First, we analyze the time complexity for calculating thebetweenness of all edges in G (10). Given that G is sparse,Johnson’s algorithm [46] may be used for calculating theshortest paths of all vertex pairs. This algorithm runs inO(|V|2 log |V| + |V||E|) time, performing faster than theFloyd–Warshall algorithm [47], which solves the problemin O(|V|3). The shortest paths of all vertex pairs are firstcomputed and stored in a data structure in order to avoidcalculating them for every edge (lines 5–7).

Next, the average and maximum values for set B(lines 9 and 11, respectively) are calculated in O(|E|) time,since this set refers to the edges in G. The transitivity of G (11)is then calculated in O(|V|2) time, as this is the worst-case sce-nario for finding all triangles and triads in a graph. Finally, thelarge loop between lines 4 and 16 will iterate q times, whereq is a number much smaller than the number of edges in G, orq |E|. This assumption comes from the fact that the numberof edges in G is reduced by 1 after each iteration.

Overall, we have the following time complexity forAlgorithm 1:

O(

q((

|V|2 log |V| + |V||E|)

+ 2|E| + 2|V|2))

(12)

which can be reduced by selecting the largest factors, obtaininga final complexity of O(q(|V|2 log |V| + |V||E|)).

D. Content Selection and Ranking

Now that the prevalent news-TCs that fall within dates d1and d2 have been identified, relevant content from the twomedia sources that is related to these topics must be selectedand finally ranked. Related items from the news media willrepresent the MF of the topic. Similarly, related items fromsocial media (Twitter) will represent the UA—more specifi-cally, the number of unique Twitter users related to the selectedtweets. Selecting the appropriate items (i.e., tweets and newsarticles) related to the key terms of a topic is not an easy task,as many other items unrelated to the desired topic also containsimilar key terms.



1) Node Weighting: The first step before selecting appro-priate content is to weight the nodes of each topic accordingly.These weights can be used to estimate which terms are moreimportant to the topic and provide insight into which ofthem must be present in an item for it to be consideredtopic-relevant.

The weight of each node i in TC is calculated using (13),which utilizes the number of edges connected to a given nodeand their corresponding weights

∀i ∈ TC : nw(i) =( |Ei|

|ErmTC|)λ

⎛⎜⎜⎝

∑e∈Ei

we

∑f ∈ETC

wf

⎞⎟⎟⎠

1−λ

(13)

where Ei is the set of edges connected to node i, ETC is theset of edges in TC, w is the weight of an edge e or f , and λ isa number between 0 and 1. In our experiments, λ is set to 0.5,as we wish to allot equal importance to the number of edgesand their weights. The value of nw falls between 0 and 1,which provides a reasonable scale to compare node weights.

We have compared the performance of our node weightingmethod with that of PageRank [16], and our method providedequal or better results. For elaboration, we provide a modifiedversion of the PageRank formula below, using our symbologyand undirected edges

∀i ∈ TC : nwpr(i) = (1 − d) + d∑j∈Vi

wij∑k∈Vj

wjk· nwpr( j)

(14)

where Vi and Vj are the sets of nodes that are connected tonodes i and j, respectively, and wij represents the weight ofthe edge between nodes i and j. The constant d represents thedamping factor, which is usually set around 0.85.

Now that the nodes of each TC are weighted according totheir estimated importance to that cluster, the weights are usedto select items from the two data sources. Given that the wayin which items are selected from tweets and news is different,we describe separately how node weights are used to selectthose items in the following two sections.

2) User Attention Estimation: To calculate the UA measureof a TC, the tweets related to that topic are first selected andthen the number of unique users who created those tweets iscounted. To ensure that the tweets are genuinely related to TC,the weight of each node in TC is utilized.

A threshold τ is first selected, which is used to specify thesum of node combinations that will be acceptable for tweetinclusion. For instance, say there is a TC with nodes i1, i2,i3, and i4, with respective weights 0.1, 0.2, 0.5, and 0.6. If τ

is set to 0.4, then tweets only containing terms i1 and i2 ora combination of the two would not be considered in relationto TC. This is because neither i1’s nor i2’s weights alone,which are 0.1 and 0.2, respectively, nor a sum of their weights,is greater than or equal to τ . In the given case, only tweetscontaining terms i3 and i4 or a combination of the two wouldbe considered.

Although it is acceptable to set τ to a fixed value, whichpresented good results in our experiments, we set τ dynami-cally. This makes our method more robust by not limiting τ

Fig. 5. Boxplot of the summed node weights of valid combinations of a TCdiscovered on a particular date.

to a particular data set. The value of τ is based on where thelargest increase in the summed weights occurs when mak-ing node combinations. Node combinations are made onlybetween nodes that share an edge between them. Thus, if thereis an edge [i1, i2] and an edge [i2, i3], but edge [i1, i3] doesnot exist, then [i1, i3] is not considered a valid combination.Only [i1, i2], [i2, i3], or [i1, i2, i3] would be considered validcombinations. With this in mind, we find the distribution ofthe summed weights of all valid combinations. An example ofthis distribution can be seen in Fig. 5.

As can be seen in the box plot, most of the values fall belowthe median and even below the first quartile. This indicates thatthere are few larger values that influence the mean. Hence,we infer that smaller values are less important and that thethreshold τ should lie somewhere within the first quartile. Wefurther deduce that the consecutive values in the first quartile,which have the largest difference between them, provide a fairestimate of where the threshold should lie. τ is thus set to thelesser of the values with the largest difference.

All valid combinations are found for a given TC and tweetsthat contain terms from combinations whose summed weightsare greater than τ are selected from the data sets. The num-ber of unique users who created these tweets is then counted,which directly represents the UA of TC. Because our mainobjective is ranking, the number of unique users related toTC must be compared to all other TCs discovered. This isachieved by employing a simple method, as can be seen inthe following equation:

∀TC ∈ G : UA(TC) = |UTC|∑TC∈G

|UTC|(15)

where |UTC| is the number of unique users related to TC andG is the entire graph. In summary, the UA of a TC in graphG is equal to the total number of unique users related to thatTC divided by the sum of total unique users related to all TCsin G. This equation produces a value between 0 and 1.

3) Media Focus Estimation: To estimate the MF of a TC,the news articles that are related to TC are first selected. Thispresents a problem similar to the selection of tweets when cal-culating UA. The weighted nodes of TC are used to accuratelyselect the articles that are genuinely related to its inherenttopic. The only difference now is that instead of comparingnode combinations with tweet content, they are compared tothe top k keywords selected from each article. Comparing theterm combinations with the actual content of the article is



unnecessary, as the reason for selecting the top k keywordsin the first place was to gain insight into the most importantterms of each article. Furthermore, the terms that make up theTCs are drawn from this set of keywords.

In summary, we must now select articles whose top k key-words are made up of valid node combinations—the sum oftheir node weights must be greater than or equal to τ . Afterselecting these articles, the MF value of each TC is calculated

∀TC ∈ G : MF(TC) = |ATC|∑TC∈G

|ATC|(16)

where |ATC| is the number of news articles related to TC andG is the entire graph. The MF of a TC in graph G is thereforeequal to the total number of news articles related to that TC(ATC) divided by the sum of the total number of articles relatedto all TCs in G. This equation also produces a value between0 and 1.

4) User Interaction Estimation: The next step is to mea-sure the degree of interaction between the users responsiblefor the social media content creation related to a specific topic.The database is queried for “followed” relationships for eachof these users and a social network graph is constructed fromthese relationships. In Twitter, there are two types of relation-ships: 1) “follow” and 2) followed. In the first type, user u1can follow user u2, in which case the direction of the rela-tionship flows from u1 to u2. In the second case, user u1 canbe followed by user u2; the direction of the relationship nowflows from u2 to u1.

In this paper, we are solely interested in the followed rela-tionships of a particular user (i.e., the second case), as thisrepresents the degree of interest other users have in the contentcreated by the user at hand. Based on the followed relation-ships that are queried, a nondirected graph STC is constructed,whose set of nodes UTC represents the users and the set ofedges RTC represents the users’ followed relationships. Thecomplex relationships of the users in STC are used to measurethe level of interaction that potentially exists between them.Before calculating the level of UI, the concept of reciprocityis first presented.

a) Reciprocity: Kwan et al. [17] proposed a measurecalled reciprocity, which attempts to detect the interactionbetween social-media users and perceive user engagement inrelation to a particular topic. Reciprocity is defined as the ratioof all edges in a social graph to the possible number of edgesit can have. Reciprocity is a useful indicator of the degree ofmutuality and reciprocal exchange in a network, which in turnrelates to social cohesion. Higher reciprocity means greaterinteraction between users, and our intuition is that topics withhigher reciprocity are more relevant, as this signifies a com-munity structure. In other words, it is clearer to identify thepower and influence of a well-structured community, ratherthan that of a decentralized and unstructured one.

A variant of the reciprocity measure is employed in ourmethod, and is defined as follows:

reciprocity(STC) = 2 × |RTC||UTC| − 1

(17)

where |UTC| is the total number of nodes in the social graph,|RTC| its total number of edges, and reciprocity(STC) is thereciprocity of social graph STC representing TC. Equation (17)is the simplified version of the formula, as the original onedivides the total number of edges by all possible edges andthen multiplies this figure by the total number of nodes forscaling purposes.

b) User interaction: The value of reciprocity(STC) isdirectly interpreted as the UI of the topic, which is calculatedusing the social-user relationships of the topic. Following ourprevious convention, the value of UI is normalized by using

∀TC ∈ G : UI(TC) = reciprocity(STC)∑TC∈G

reciprocity(STC). (18)

5) Overall Ranking: Now that the MF, UA, and UI mea-sures have been defined, their values are combined into asingle metric r. This metric represents the overall ranking ofa topic. Equation (19) is employed to calculate the overallranking of a TC

∀TC ∈ G : r(TC) = (MF(TC))α(UA(TC))β(UI(TC))γ

(19)

where α + β + γ = 1.In our experiments, all three exponents are set to the same

value (1/3), though each factor in the equation can be weighteddifferently depending on the application at hand. The TCs arethen sorted in descending order by their r value.

IV. EXPERIMENTS AND RESULTS

The testing dataset consists of tweets crawled from Twitterpublic timeline and news articles crawled from popular newswebsites during the period between November 1, 2013 andFebruary 28, 2014. The news websites crawled werecnn.com,bbc.com, cbsnews.com, reuters.com, abcnews.com, andusatoday.com. Over the specified period of time, a total of105 856 news articles and 175 044 074 bilingual tweets werecollected. After non-English tweets were discarded, 71 731 730tweets remained. The dataset was divided into two partitions.

1) Data from January and February 2014 were used as thetesting dataset, on which experiments were performedfor the overall method evaluation.

2) Data from November and December 2013 were used asthe control dataset, where experiments were performedto establish adequate thresholds and select measures thatpresented the best results.

A. Method Evaluation

The evaluation of topic ranking is quite challenging, as theinterpretation of the results is generally subjective. However,in an attempt to show that the ranked topics are indeed thosethat users would prefer, a method for ranking popular newstopics must be established.

To retrieve the most popular topics, Google’s newsaggregation service [48] was utilized. For each day fromNovember 1, 2013 to February 28, 2014, the top 10 news sto-ries displayed on this site were collected at the end of the day.

http://werecnn.com

http://bbc.com

http://cbsnews.com

http://reuters.com

http://abcnews.com

http://usatoday.com



TABLE ISOME STATISTICS RELEVANT TO THE TESTING DATASET

Next, 20 master’s and doctoral students were asked to view thetitles of the top 10 news stories from each day and select theones they considered relevant. Each participant was requiredto select a minimum of two articles per day and a maximumof all 10.

The participants’ results were then partitioned into 12date ranges: 1) November 1, 2013–November 10, 2013;2) November 11, 2013–November 20, 2013; 3) November21, 2013–November 30, 2013; 4) December 1, 2013–December 10, 2013; 5) December 10, 2013–December20, 2013; 6) December 20, 2013–December 31, 2013;7) January 1, 2014–January 10, 2014; 8) January 11, 2014–January 20, 2014; 9) January 21, 2014–January 31, 2014;10) February 1, 2014–February 10, 2014; 11) February 11,2014–February 20, 2014; and 12) February 21, 2014–February28, 2014. As explained earlier, the first six data ranges wereutilized for the controlled experiments and the last six for themethod evaluation.

The stories for each range were manually grouped into top-ics, where each topic was given a score for its correspondingtime range. The score for each topic is calculated by multiply-ing the number of times the participants voted for the storiesin the topic by the total number of stories in it. For instance,if in the time range January 1, 2014–January 10, 2014 a topiccontained five stories (which of course occurred on five differ-ent days within that time range), and the total votes for thosestories was 15, then the score of that topic would be 75.

The topics were finally ranked in descending order of theirscore, resulting in a ranked topic list for each time range. Wewill hereafter refer to this ranked list of topics as voted topics.For each of the aforementioned time ranges we set the start(d1) and end (d2) dates accordingly and executed SociRank.

Some statistics relevant to the news topics extracted forthe testing dataset (i.e., from January and February 2014)are shown in Table I. The second column in the table indi-cates the total number of topics extracted per time period. Thethird, fourth, and fifth columns show the average number oftweets, news stories, and Twitter users associated with eachtopic, respectively.

1) Topic Selection Evaluation: First, we evaluate the topic-selection completeness of SociRank and the media-focus-onlyapproach3 (hereafter referred to as MF). To represent the MFmethod, we simply manipulate (19) by setting β and γ to 0

3The results for UA and UI alone are not shown in our experiments, asthese have no topic-specific significance without the MF component. In otherwords, without the element of MF, any identified topics would only reflectwhat is popular among users, without these necessarily being related to themedia (news).

Fig. 6. Percentage of overlap between all voted topics and all topics selectedby SociRank and MF.

and α to 1. To evaluate completeness, all of the topics (pertime range) selected by SociRank and by the MF approachwere compared with those in the voted topics.

Fig. 6 shows the percentage of topics selected by SociRankand by MF that overlap with the voted topics. It can be seenthat SociRank clearly outperforms MF in terms of overlapwith the voted topics (i.e., those topics that users selected asthe most important). This indicates that SociRank is better atdiscovering prevalent news topics that users find interestingwhen compared to a method that only utilizes data from thenews media.

2) Topic Ranking Evaluation: Next, we evaluate the rank-ing of topics using the SociRank and MF ranking formulas,selecting only the top k topics from each approach. Again,we compare the ranked topics for each time range with thevoted topics. We evaluated the top 10, 20, 30, and 40 top-ics for both methods and calculated the percentage of topicsthat overlapped with the top 10, 20, 30, and 40 voted top-ics, respectively. For instance, if we are comparing the top 10topics using SociRank with the voted topics, and five topicsoverlap, then the overlap percentage would be 5/10.

Fig. 7 shows the percentage of overlap (for each of thetime ranges) between the top 10 voted topics and the top 10topics using the SociRank and MF approaches. In the figure,the green line represents the MF overlap percentage plus 1standard deviation. If the SociRank bar surpasses this line, itindicates that there is a large enough difference between thetwo overlap percentages for the improvement to be consideredsignificant. However, as can be seen in the figure, this doesnot occur—there is no significant difference between the twomethods in the top 10 ranked list.

In the top 20, 30, and 40 ranked lists, however, the resultsfavored SociRank, as can be seen in Figs. 8–10, respectively. Ineach of these lists, the overlap percentage of SociRank signif-icantly surpasses that of the MF approach. Summarizing ourfindings, Fig. 11 represents the average overlap percentagesof the top 10, 20, 30, and 40 ranked lists for both meth-ods. The figure clearly illustrates that, with the exception of



Fig. 7. Percentage of overlap between top 10 voted topics and top 10 topicsselected by SociRank and MF.


the top 10 list, SociRank outperformed the MF method by asignificant margin.

B. Controlled Experiments

In this section, we show experiments performed on some ofthe variables and components used in our method, namely, theco-occurrence measure, IQR coefficient, and node weighting.For the controlled experiments, the control dataset was dividedinto six partitions, with each partition representing ten days’worth of news and tweets. The time periods tested were thosecorresponding to November and December 2013.

1) Co-Occurrence Measure: In Section III-B3, it wasexplained that five similarity measures were tested: 1) Dice;2) cosine; 3) Jaccard; 4) PMI; and 5) overlay. To determinewhich similarity measure performs best, all parameters inSociRank were controlled, with the exception of the similaritymeasure.



In order to measure the effectiveness of each similaritymeasure, the number of correctly identified topics that eachmeasure produced was counted. Fig. 12 shows the number ofcorrectly identified topics for each similarity measure, groupedby time period. As can be seen in the figure, the Dice similaritymeasure produced the greatest number of correctly identifiedtopics. The closest competitor to Dice was the cosine similaritymeasure, but even this measure fell noticeably short of pro-ducing similar numbers. The other measures performed poorlyin our dataset, and in some instances were unable to identifyany topics at all. The Dice similarity measure was thereforeselected as being the most fit to fulfill our needs. Dependingon the dataset used, however, other similarity measures maypotentially produce similar or better results than Dice.

2) IQR Coefficient: To estimate the most appropriate valuefor coefficient c in (9), several values were tested, startingfrom c = 1, and taking increments of 1. The previously listed



Fig. 11. Average percentage of overlap between top k voted topics and topk topics selected by SociRank and MF.

Fig. 12. Evaluation of different co-occurrence similarity measures used asedge weights in graph G (Section III-B3).

time periods were utilized, and all other parameters besides cwere controlled. For each value of c, the number of correctlyidentified topics was counted, and the value that yielded thehighest number was considered most effective one.

Fig. 13 shows the number of correctly identified topics fordifferent values of c. It can be seen that c = 4 yielded the bestresults. When c was in the range between 1 and 4, the numberof correctly identified topics increased with each incrementof 1, but setting c = 5 decreased the count significantly. Thus,c = 4 was subsequently selected as the optimum value in ourexperiments.

3) Node Weighting: The node-weighting formula (13) pre-sented in Section III-D1 takes into account both numberof edges connected to a node and their weights. A morenaive approach might be to utilize only the number ofedges connected to a node as a representation of its weight.

Fig. 13. Evaluation of different values for IQR coefficient c in the outlierdetection formula (9).

We compared both of these approaches to the PageRankweighting algorithm [16] to ascertain which approach providedthe best results, as shown in Fig. 14. In the figure, weight-ing the nodes simply by the number of edges is referred toas “edge,” and weighting by both number of edges and theirweights (13) is referred to as “edge-weight.”

Recalling from Section III-D1, node weighting is used tocorrectly select related content from Twitter. We therefore usethis same measure to evaluate the effectiveness of each node-weighting approach. The bars in Fig. 14 represent the numberof correctly selected tweets by each node-weighting approach,and the line graph represents the precision of each approach.We calculate the precision using the following formula:

precision = R+

R+ + R− (20)

where R+ is the number of correct records retrieved and R−is the number of incorrect records retrieved.

In our experiments, thirty random topics from the afore-mentioned time periods were selected (five from each period),and each node-weighting approach was used to retrieve tweetsrelated to those topics. Only 10% of the tweets were used.As can be seen in Fig. 14, the precision with which correcttweets were selected was the same or had very little differencein most cases, regardless of the approach. However, althoughthe precision was somewhat negligible, the same cannot besaid about the number of correct tweets selected. Using theedge-weight approach (13) yielded a far greater number of cor-rectly selected tweets than the other two approaches. Selectinga higher number of correctly related tweets at a similar orgreater precision than the other two approaches is preferable,so the nodes in the TCs are ultimately weighted using ourapproach.

C. Comparison of Ranked Topic Lists

Lastly, we provide an example of a ranked list of topicsproduced by the MF approach and one produced by SociRank.



Fig. 14. Evaluation of different node-weighting approaches (Section III-D1).

TABLE IICOMPARISON OF RANKED TOPIC LISTS PRODUCED BY THE MF AND SOCIRANK METHODS

Table II shows the top 10 topics obtained by the MF andSociRank approaches for the February 1, 2014–February 10,2014 time period. In the table, the cells with the new topicsare shaded in green and the cells with the topics that wereremoved from the MF list are shaded in red. As can be seen,using all three factors produced a list with three new topicsthat did not appear in the MF list. Additionally, many of thetopics in the SociRank list either moved up or down in positionas compared with the MF list.

This example shows that using SociRank produces a verydifferent ranked list of news topics, which may signify thatrelying only on high-frequency news topics provided by themedia does not necessarily give insight into what users areinterested on or consider important. Utilizing the user focusand UI factors can also uncover popular topics that were pre-viously not considered so significant by the mass media, suchas the three topics shaded in green in Table II. This infor-mation can hint media sources into giving continuity or moreimportance to certain news stories that were perhaps thoughtof as unpopular.

Taking all results into consideration emphasizes the pointthat MF alone is a substandard estimator of what users findinteresting or consider important, and should therefore notbe used in this way. SociRank, on the other hand, proves

to be more capable of performing this, and so we con-clude that the information provided by SociRank can provevital in commerce-based areas where the interest of users isparamount.

V. CONCLUSION

In this paper, we proposed an unsupervised method—SociRank—which identifies news topics prevalent in bothsocial media and the news media, and then ranks them bytaking into account their MF, UA, and UI as relevance fac-tors. The temporal prevalence of a particular topic in the newsmedia is considered the MF of a topic, which gives us insightinto its mass media popularity. The temporal prevalence of thetopic in social media, specifically Twitter, indicates user inter-est, and is considered its UA. Finally, the interaction betweenthe social media users who mention the topic indicates thestrength of the community discussing it, and is considered theUI. To the best of our knowledge, no other work has attemptedto employ the use of either the interests of social media usersor their social relationships to aid in the ranking of topics.

Consolidated, filtered, and ranked news topics from bothprofessional news providers and individuals have several ben-efits. One of its main uses is increasing the quality and variety



of news recommender systems, as well as discovering hidden,popular topics. Our system can aid news providers by pro-viding feedback of topics that have been discontinued by themass media, but are still being discussed by the general pop-ulation. SociRank can also be extended and adapted to othertopics besides news, such as science, technology, sports, andother trends.

We have performed extensive experiments to test the perfor-mance of SociRank, including controlled experiments for itsdifferent components. SociRank has been compared to media-focus-only ranking by utilizing results obtained from a manualvoting method as the ground truth. In the voting method,20 individuals were asked to rank topics from specified timeperiods based on their perceived importance. The evaluationprovides evidence that our method is capable of effectivelyselecting prevalent news topics and ranking them based onthe three previously mentioned measures of importance. Ourresults present a clear distinction between ranking topics byMF only and ranking them by including UA and UI. This dis-tinction provides a basis for the importance of this paper, andclearly demonstrates the shortcomings of relying solely on themass media for topic ranking.

As future work, we intend to perform experiments andexpand SociRank on different areas and datasets. Furthermore,we plan to include other forms of UA, such as search engineclick-through rates, which can also be integrated into ourmethod to provide even more insight into the true interestof users. Additional experiments will also be performed indifferent stages of the methodology. For example, a fuzzyclustering approach could be employed in order to obtain over-lapping TCs (Section III-C). Lastly, we intend to develop apersonalized version of SociRank, where topics are presenteddifferently to each individual user.

REFERENCES

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”J. Mach. Learn. Res., vol. 3, pp. 993–1022, Jan. 2003.

[2] T. Hofmann, “Probabilistic latent semantic analysis,” in Proc. 15th Conf.Uncertainty Artif. Intell., 1999, pp. 289–296.

[3] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. 22ndAnnu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, Berkeley,CA, USA, 1999, pp. 50–57.

[4] C. Wartena and R. Brussee, “Topic detection by clustering keywords,”in Proc. 19th Int. Workshop Database Expert Syst. Appl. (DEXA), Turin,Italy, 2008, pp. 54–58.

[5] F. Archetti, P. Campanelli, E. Fersini, and E. Messina, “A hier-archical document clustering environment based on the inducedbisecting k-means,” in Proc. 7th Int. Conf. Flexible QueryAnswering Syst., Milan, Italy, 2006, pp. 257–269. [Online]. Available:http://dx.doi.org/10.1007/11766254_22.

[6] C. D. Manning and H. Schütze, Foundations of Statistical NaturalLanguage Processing. Cambridge, MA, USA: MIT Press, 1999.

[7] M. Cataldi, L. Di Caro, and C. Schifanella, “Emerging topic detec-tion on Twitter based on temporal and social terms evaluation,”in Proc. 10th Int. Workshop Multimedia Data Min. (MDMKDD),Washington, DC, USA, 2010, Art. no. 4. [Online]. Available:http://doi.acm.org/10.1145/1814245.1814249.

[8] W. X. Zhao et al., “Comparing Twitter and traditional media using topicmodels,” in Advances in Information Retrieval. Heidelberg, Germany:Springer Berlin Heidelberg, 2011, pp. 338–349.

[9] Q. Diao, J. Jiang, F. Zhu, and E.-P. Lim, “Finding bursty topics frommicroblogs,” in Proc. 50th Annu. Meeting Assoc. Comput. Linguist. LongPapers, vol. 1. 2012, pp. 536–544.

[10] H. Yin, B. Cui, H. Lu, Y. Huang, and J. Yao, “A unified model forstable and temporal topic detection from social media data,” in Proc.IEEE 29th Int. Conf. Data Eng. (ICDE), Brisbane, QLD, Australia, 2013,pp. 661–672.

[11] C. Wang, M. Zhang, L. Ru, and S. Ma, “Automatic online news topicranking using media focus and user attention based on aging theory,”in Proc. 17th Conf. Inf. Knowl. Manag., Napa County, CA, USA, 2008,pp. 1033–1042.

[12] C. C. Chen, Y.-T. Chen, Y. Sun, and M. C. Chen, “Life cyclemodeling of news events using aging theory,” in Machine Learning:ECML 2003. Heidelberg, Germany: Springer Berlin Heidelberg, 2003,pp. 47–59.

[13] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman,and J. Sperling, “TwitterStand: News in tweets,” in Proc. 17th ACMSIGSPATIAL Int. Conf. Adv. Geograph. Inf. Syst., Seattle, WA, USA,2009, pp. 42–51.

[14] O. Phelan, K. McCarthy, and B. Smyth, “Using Twitter to recom-mend real-time topical news,” in Proc. 3rd Conf. Recommender Syst.,New York, NY, USA, 2009, pp. 385–388.

[15] K. Shubhankar, A. P. Singh, and V. Pudi, “An efficient algorithm fortopic ranking and modeling topic evolution,” in Database Expert Syst.Appl., Toulouse, France, 2011, pp. 320–330.

[16] S. Brin and L. Page, “Reprint of: The anatomy of a large-scale hypertex-tual web search engine,” Comput. Netw., vol. 56, no. 18, pp. 3825–3833,2012.

[17] E. Kwan, P.-L. Hsu, J.-H. Liang, and Y.-S. Chen, “Event identificationfor social streams using keyword-based evolving graph sequences,” inProc. IEEE/ACM Int. Conf. Adv. Soc. Netw. Anal. Min., Niagara Falls,ON, Canada, 2013, pp. 450–457.

[18] K. Kireyev, “Semantic-based estimation of term informativeness,” inProc. Human Language Technol. Annu. Conf. North Amer. ChapterAssoc. Comput. Linguist., 2009, pp. 530–538.

[19] G. Salton, C.-S. Yang, and C. T. Yu, “A theory of term importance inautomatic text analysis,” J. Amer. Soc. Inf. Sci., vol. 26, no. 1, pp. 33–44,1975.

[20] H. P. Luhn, “A statistical approach to mechanized encoding and search-ing of literary information,” IBM J. Res. Develop., vol. 1, no. 4,pp. 309–317, 1957.

[21] J. D. Cohen, “Highlights: Language- and domain-independent automaticindexing terms for abstracting,” J. Amer. Soc. Inf. Sci., vol. 46, no. 3,pp. 162–174, 1995.

[22] Y. Matsuo and M. Ishizuka, “Keyword extraction from a single documentusing word co-occurrence statistical information,” Int. J. Artif. Intell.Tools, vol. 13, no. 1, pp. 157–169, 2004.

[23] R. Mihalcea and P. Tarau, “TextRank: Bringing order into texts,” in Proc.EMNLP, vol. 4. Barcelona, Spain, 2004.

[24] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, andC. G. Nevill-Manning, “KEA: Practical automatic keyphrase extrac-tion,” in Proc. 4th ACM Conf. Digit. Libr., Berkeley, CA, USA, 1999,pp. 254–255.

[25] P. D. Turney, “Learning algorithms for keyphrase extraction,” Inf.Retrievel, vol. 2, no. 4, pp. 303–336, 2000.

[26] J. Wang, H. Peng, and J.-S. Hu, “Automatic keyphrases extraction fromdocument using neural network,” in Advances in Machine Learning andCybernetics. Heidelberg, Germany: Springer, 2006, pp. 633–641.

[27] T. Jo, M. Lee, and T. M. Gatton, “Keyword extraction from docu-ments using a neural network model,” in Proc. Int. Conf. Hybrid Inf.Technol. (ICHIT), vol. 2. 2006, pp. 194–197.

[28] K. Sarkar, M. Nasipuri, and S. Ghose, “A new approach to keyphraseextraction using neural networks,” Int. J. Comput. Sci. Issues, vol. 7,no. 3, pp. 16–25, Mar. 2010.

[29] C. Zhang, “Automatic keyword extraction from documents using condi-tional random fields,” J. Comput. Inf. Syst., vol. 4, no. 3, pp. 1169–1180,2008.

[30] G. Figueroa and Y.-S. Chen, “Collaborative ranking between supervisedand unsupervised approaches for keyphrase extraction,” in Proc. Conf.Comput. Linguist. Speech Process. (ROCLING), 2014, pp. 110–124.

[31] H.-H. Chen, M.-S. Lin, and Y.-C. Wei, “Novel association measuresusing Web search with double checking,” in Proc. 21st Int. Conf.Comput. Linguist. 44th Annu. Meeting Assoc. Comput. Linguist., 2006,pp. 1009–1016.

[32] D. Bollegala, Y. Matsuo, and M. Ishizuka, “Measuring semantic simi-larity between words using Web search engines,” in Proc. WWW, Banff,AB, Canada, 2007, pp. 757–766.

[33] D. Szymkiewicz, “Etude comparative de la distribution florale,” Rev.Forest, vol. 1, 1926.

http://dx.doi.org/10.1007/11766254_22

http://doi.acm.org/10.1145/1814245.1814249



[34] L. R. Dice, “Measures of the amount of ecologic association betweenspecies,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.

[35] K. W. Church and P. Hanks, “Word association norms, mutualinformation, and lexicography,” Comput. Linguist., vol. 16, no. 1,pp. 22–29, Mar. 1990. [Online]. Available: http://dl.acm.org/citation.cfm?id=89086.89095.

[36] P. Jaccard, Etude comparative de la distribution florale dans une portiondes Alpes et du Jura. Lausanne, Switzerland: Impr. Corbaz, 1901.

[37] H. Iwasaka and K. Tanaka-Ishii, “Clustering co-occurrence graph basedon transitivity,” presented at the 5th Workshop Very Large Corpora,1997, pp. 91–100.

[38] Y. Matsuo, T. Sakaki, K. Uchiyama, and M. Ishizuka, “Graph-basedword clustering using a Web search engine,” in Proc. Conf. Empir.Methods Nat. Lang. Process., 2006, pp. 542–550.

[39] M. Girvan and M. E. J. Newman, “Community structure in socialand biological networks,” Proc. Nat. Acad. Sci., vol. 99, no. 12,pp. 7821–7826, 2002.

[40] M. E. J. Newman, “Fast algorithm for detecting community structure innetworks,” Phys. Rev. E, vol. 69, no. 6, 2004, Art. no. 066133.

[41] Twitter. [Online]. Available: http://www.twitter.com, accessedFeb. 2014.

[42] K. Gimpel et al., “Part-of-speech tagging for Twitter: Annotation, fea-tures, and experiments,” in Proc. 49th Annu. Meeting Assoc. Comput.Linguist. Human Lang. Technol. Short Papers, vol. 2. Portland, OR,USA, 2011, pp. 42–47.

[43] A. J. Viterbi, “Error bounds for convolutional codes and an asymptoti-cally optimum decoding algorithm,” IEEE Trans. Inf. Theory, vol. IT-13,no. 2, pp. 260–269, Apr. 1967.

[44] M. Hubert and S. Van der Veeken, “Outlier detection for skewed data,”J. Chemometr., vol. 22, nos. 3–4, pp. 235–246, 2008.

[45] U. Brandes, “On variants of shortest-path betweenness centrality andtheir generic computation,” Soc. Netw., vol. 30, no. 2, pp. 136–145,2008.

[46] D. B. Johnson, “Efficient algorithms for shortest paths in sparse net-works,” J. ACM, vol. 24, no. 1, pp. 1–13, Jan. 1977. [Online]. Available:http://doi.acm.org/10.1145/321992.321993.

[47] R. W. Floyd, “Algorithm 97: Shortest path,” Commun.ACM, vol. 5, no. 6, p. 345, Jun. 1962. [Online]. Available:http://doi.acm.org/10.1145/367766.368168.

[48] Google News. [Online]. Available: http://news.google.com, accessedFeb. 2014.

Derek Davis was born in Belize City, Belize, in1984. He received the B.Sc. degree in informa-tion technologies from the University of Belize,Belmopan, Belize, and the M.Sc. degree ininformation systems and applications from NationalTsing Hua University, Hsinchu, Taiwan.

He was a Systems Analyst/DatabaseAdministrator for a Belizean crude oil drillingcompany. He is currently a Systems Analyst in thebanking and finance industry. His research interestsinclude keyword extraction, summarization, and

natural language processing.

Gerardo Figueroa was born in Tegucigalpa,Honduras, in 1984. He received the B.Sc.degree in computational systems engineeringfrom Universidad Tecnologica Centroamericana,Tegucigalpa, and the M.Sc. degree in informationsystems and applications from National Tsing HuaUniversity, Hsinchu, Taiwan, where he is currentlypursuing the Ph.D. degree in information systemsand applications.

From 2011 to 2015, he was a Teacher Assistantwith the Institute of Information Systems and

Applications and the Department of Computer Science, National Tsing HuaUniversity. His current research interests include keyword extraction andsummarization, national language processing, and machine learning.

Yi-Shin Chen received the B.A and M.B.A. degreesin information management from National CentralUniversity, Taoyuan, Taiwan, in 1996 and 1997,respectively, and the Ph.D. degree in computer sci-ence from the University of Southern California, LosAngeles, CA, USA, in 2002.

She joined the Department of Computer Science,National Tsing Hua University, as an AssistantProfessor in 2004. She is passionate about increasingthe benefits of society through her research efforts.Her goal is to create a social media interface that

explores and visualizes Web data easily. Her current research interests includeWeb intelligence and integration.

http://dl.acm.org/citation.cfm?id=89086.89095

http://dl.acm.org/citation.cfm?id=89086.89095

http://www.twitter.com

http://doi.acm.org/10.1145/321992.321993

http://doi.acm.org/10.1145/367766.368168

http://news.google.com

ieee transactions on systems, man, and … identifying and ranking prevalent news...2 ieee...

Documents