ecai 2014 poster - unsupervised semantic clustering of twitter hashtags

Unsupervised semantic clustering of Twitter hashtags

Introduction

Methodology

Future work

Universitat Rovira i VirgiliITAKA research group - Intelligent Technologies for Advanced Knowledge Acquisition

Dept. of Computer Science and MathematicsAvda. Paisos Catalans, 26. 43007 Tarragona, Spain.

{carlos.vicient, antonio.moreno}@urv.cat

Using this matrix, the similarity between concepts is calculated using the Wu-Palmer [2] distance.

21st European Conference on Artificial Intelligence. ECAI 2014. Prague. August 18th-22nd.

Micro-blogging services such as Twitter constitute one of the most successful kinds of applications in the current Social Web.

Growing interest in the design and development of tools that allow users to analyse large unstructured repositories of user-tagged data. Problems of data visualisation. Semantic information retrieval & information extraction. Hashtag recommendation,etc.

Our research is focused in the automated clustering of the hashtags present in a set of tweets, which may lead to a straightforward discovery of its main topics.

Every day more than 500 million tweets are sent.

[1] Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press.

Table 1. Results of Oncology tweet set

Hashtags are unstructured and unlimited: problems like synonymy, polysemy, lexically similar hashtags, acronyms,a combination of several words, or just invented expressions.Current hashtag clustering methods are mostly based on a syntactic analysis of their co-ocurrence.

Goal: The aim of this stage is to fill the gap between hashtags and concepts in order to be able to compare two different hashtags at the semantic level.

semantic clustering of a set of hashtags and the identification of its most relevant topics, filtering the large proportion of noise inherent to these sets.

This unsupervised domain-independent methodology allows a

1)---

2)--

Oncology dataset: 5000 tweets (www.symplur.com) 1086 hashtags (930 annotated)

Automatic results (minK=5, MaxK=200, t1=0.70 and t2=10): 13 of the 16 target manual classes were identified 15 relevant clusters (with a total number of 266 hashtags)

536 relevant medical hashtags classified in 16 manually labelled classes.

Fig.1. Semantic annotation process

1)

2)

3)

Analysis of the full content of the tweets.

Improve the treatment of polysemic hashtags.

Change the bottom-up analysis of the hierarchical tree by a top-down procedure, so that general clusters are considered before the specific ones.

Carlos Vicient, Antonio Moreno

Semantic Annotation

WordNet [1]

Entity

illness

tumor

cancer

sign ofzodiac

cancer(crab)

medicalinstitution

clinic

organization

Clinic

Organization

, Companies,Hospital, Award, Care,Winner, ][

Semantic SimilarityGoal: Establish a mechanism to compare two different hashtags.Build a similarity matrix that contains all the hashtags.

Clustering & filtering

Evaluation

Goal: Group hashtags in clusters and select the most relevant ones.A bottom-up analysis from maxK clusters to minK clusters is performed.

Fig.2. Filtering process (t1=0.6, t2=3)

Filtering thresholds:

-- T1: establish the minimum inter-cluster homogeneity. T2: establish the minimum number of elements per cluster.

Id Centroid Size Prec. Rec. Manual class

1 Woody_Plant 11 64% 41% Substances 2 Day 10 50% 24% Temporal

3 Therapy 20 75% 37% Medical tests 4 Medicine 17 76% 23% Medications 5 Cancer 46 80% 62% Cancer

6 Court 14 43% 43% Hospitals 7 Biotechnology 10 60% 15% Biological 8 Health 23 43% 45% Health Care

9 Medicine 43 60% 63% Medical Fields 10 System 10 70% 21% Body Parts 11 Area 11 73% 18% Geographical

Locations 12 Teaching 10 40% 14% Academic,

Research

13 Person 16 38% 12% Medical Jobs 14 Center 10 40% 12% Body Parts 15 Doctor 15 60% 18% Medical Jobs

16 – 31 - - - - Noise

2530

793 769

371293

12952 30 2 3 24 1

1 2 3 4 5 6 7 8 9 10 11 12

#

t

w

e

e

t

s

#hashtags

[2] Wu, Z. and Palmer, M.: Verb semantics and lexical selection. In Proceedings of the 32nd annual Meeting of the Association for Computational Linguistics, New Mexico, USA. 1994. 133-138

Hashtag1: “#Mayo” { clinic, organization }

Hashtag2: “#AustinCancerCenter” { center, medical institution }

0.4 0.6 0.80.9

*c = LeastCommonSubsumer(a,b)