normalising medical concepts in social media texts by...

Normalising Medical Concepts in Social Media Texts by LearningSemantic Representation

Nut Limsopatham and Nigel CollierLanguage Technology Lab

Department of Theoretical and Applied LinguisticsUniversity of Cambridge

Cambridge, UK{nl347,nhc30}@cam.ac.uk

Abstract

Automatically recognising medical con-cepts mentioned in social media messages(e.g. tweets) enables several applicationsfor enhancing health quality of people ina community, e.g. real-time monitoring ofinfectious diseases in population. How-ever, the discrepancy between the type oflanguage used in social media and med-ical ontologies poses a major challenge.Existing studies deal with this challengeby employing techniques, such as lexi-cal term matching and statistical machinetranslation. In this work, we handle themedical concept normalisation at the se-mantic level. We investigate the use ofneural networks to learn the transition be-tween layman’s language used in socialmedia messages and formal medical lan-guage used in the descriptions of medi-cal concepts in a standard ontology. Weevaluate our approaches using three differ-ent datasets, where social media texts areextracted from Twitter messages and blogposts. Our experimental results show thatour proposed approaches significantly andconsistently outperform existing effectivebaselines, which achieved state-of-the-artperformance on several medical conceptnormalisation tasks, by up to 44%.

1 Introduction

Existing studies (O’Connor et al., 2014; Lim-sopatham and Collier, 2015a; Limsopatham andCollier, 2015b) have shown that data from socialmedia (e.g. Twitter1 and Facebook2) can be lever-aged to improve the understanding of patients’ ex-

1http://twitter.com2http://facebook.com

perience in healthcare, such as the spread of infec-tious diseases and side-effects of drugs. However,the lexical and grammatical variability of the lan-guage used in social media poses a key challengefor extracting information (Baldwin et al., 2013;O’Connor et al., 2014). In particular, the frequentuse of informal language, non-standard grammarand abbreviation forms, as well as typos in socialmedia messages has to be taken into account byeffective information extraction systems.

The task of medical concept normalisation forsocial media text, which aims to map a variablelength social media message to a medical con-cept in some external coding system, is facedwith a similar challenge (Limsopatham and Col-lier, 2015b). Traditional approaches, e.g. (Ris-tad and Yianilos, 1998; Aronson, 2001; Lu etal., 2011; McCallum et al., 2012), used prox-imity matching or heuristic string matching rulesbased on dictionary lookup when mapping textsto medical concepts. For example, Ristad andYianilos (1998) incorporated edit-distance whenmapping similar texts. The MetaMap system ofAronson (2001) applied a rule-based approach us-ing pre-defined variants of terms when mappingtexts to medical concepts in the UMLS Metathe-saurus3. However, as shown in Table 1, exist-ing string matching techniques may not be able tomap the social media message “moon face and 30lbs in 6 weeks” to the medical concept ‘WeightGain’, or map “head spinning a little” to ‘Dizzi-ness’, as no words in the social media messagesand the description of the medical concepts corre-spond. Recent studies, e.g. (Leaman et al., 2013;Leaman and Lu, 2014; Limsopatham and Collier,2015a), applied machine learning techniques totake into account relationships between differentwords (e.g. synonyms) when performing normal-

3https://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

Social media message Description of corresponding medical conceptlose my appetite Loss of appetitei don’t hunger or thirst Loss of Appetitehungry Hungermoon face and 30 lbs in 6 weeks Weight Gaingained 7 lbs Weight Gainlose the 10 lbs Body Weight Decreasedfeeling dizzy ... Dizzinesshead spinning a little Dizzinessterrible headache!! Headache

Table 1: Examples of social media messages and their related medical concepts.

isation. For instance, the DNorm system of Lea-man et al. (2013), which achieved state-of-the-artperformance on several medical concept normal-isation tasks for medical articles (Dogan et al.,2014) and patient records (Suominen et al., 2013),used a pairwise learning-to-rank technique to learnthe similarity between different terms when per-forming concept normalisation. Limsopatham andCollier (2015a) leveraged translations between theinformal language used in social media and theformal language used in the description of medicalconcepts in an ontology. However, we argue thateffective concept normalisation requires a systemto take into account the semantics of social me-dia messages and medical concepts. For example,to be able to map from the social media message“i don’t hunger or thirst” to the medical concept‘Loss of Appetite’, a normalisation system has totake into account the semantics of the whole mes-sage; otherwise, “i don’t hunger or thirst” may bemapped to the medical concept ‘Hunger’, becausethey contain the term “hunger” in common.

In this work, we go beyond string matching. Wepropose to learn and exploit the semantic simi-larity between texts from social media messagesand medical concepts using deep neural networks.In particular, we investigate the use of techniquesfrom two families of deep neural networks, i.e. aconvolutional neural network (CNN) and a recur-rent neural network (RNN), to learn the mappingbetween social media texts and medical concepts.We evaluate our approaches using three differentdatasets that contain messages from Twitter andblog posts. Our experimental results show that ourproposed approaches significantly outperform ex-isting strong baselines (e.g. DNorm) across all ofthe three datasets. The performance improvementis by up to 44%.

The main contributions of this paper are three-fold:

1. We propose two novel approaches based onCNN and RNN for medical concept normali-sation.

2. We introduce two datasets with the gold-standard mappings between medical con-cepts and social media texts extracted fromtweets and blog posts, respectively.

3. We thoroughly evaluate our proposed ap-proaches using these two datasets and an ex-isting dataset of tweets related to the topicof adverse drug reactions (ADRs) (Lim-sopatham and Collier, 2015a).

The remainder of this paper is organised as fol-lows. In Section 2, we discuss related work andposition our paper in the literature. Section 3 in-troduces our neural network approaches for med-ical concept normalisation. We describe our ex-perimental setup and empirically evaluate our pro-posed approaches in Sections 4 and 5, respec-tively. We provide further analysis and discussionof our approaches in Section 6. Finally, Section 7provides concluding remarks.

2 Related Work

Existing techniques for concept normalisation aremostly based on string matching (e.g. (Tsuruokaet al., 2007; Ristad and Yianilos, 1998; Lu etal., 2011; McCallum et al., 2012). For exam-ple, McCallum et al. (2012) used conditional ran-dom field to learn edit distance between phrases.In the medical domain, Tsuruoka et al. (2007)learned mappings between phrases in medicaldocuments and medical concepts by using stringmatching features, such as character bigrams and

common tokens. Meanwhile, Metke-Jimenez andKarimi (2015), and O’Connor et al. (2014) usedterm weighting techniques, such as TF-IDF andBM25 (Robertson and Zaragoza, 2009) to retrieverelevant concepts. We tackle the concept normali-sation task in a different manner. In particular, weuse deep neural networks to capture the similarityand/or dependency between terms and effectivelyrepresent a given social media message in a low di-mensional vector representation, before mappingit to a medical concept.

Another research area related to this work isthe exploitation of word embeddings (i.e. dis-tributed vector representation of words). It hasbeen empirically shown that word embeddingscan capture semantic and syntactic similarities be-tween words (Turian et al., 2010; Mikolov et al.,2013b; Pennington et al., 2014; Levy and Gold-berg, 2014). The cosine similarity between vectorsof words has a positive correlation with the seman-tic similarity between them (Mikolov et al., 2013b;Pennington et al., 2014). Importantly, word em-beddings have been effectively used for severalNLP tasks, such as named entity recognition (Pas-sos et al., 2014), machine translation (Mikolov etal., 2013a) and part-of-speech tagging (Turian etal., 2010). In the context of concept normalisa-tion, Limsopatham and Collier (2015a) showedthat effective performance could be achieved bymapping the processed social media messages andmedical concepts using the similarity of their em-beddings. In this work, we use word embeddingsas inputs of deep neural networks, which wouldallow an effective representation of words whenlearning the concept normalisation.

Neural networks, such as convolutional neu-ral networks (CNN) and recurrent neural net-works (RNN), have been effectively applied toNLP tasks, such as NER, sentiment classifica-tions and machine translation (Collobert et al.,2011; Kim, 2014; Bahdanau et al., 2014). Forexample, Collobert et al. (2011) effectively useda multilayer neural network for chunking, part-of-speech tagging, NER and semantic role labelling.Kim (2014) effectively used CNN with pre-builtword embeddings when performing sentence clas-sifications. Kalchbrenner et al. (2014) learned rep-resentation of sentences by using CNN. Mean-while, Bahdanau et al. (2014) used RNN to encodea sentence written in one language (e.g. French)into a fixed length vector before decoding it to

Figure 1: Our CNN architecture for medical con-cept normalisation.

a sentence in another language (e.g. English) fortranslation. Socher et al. used recursive neural net-works to model sentences for different tasks, in-cluding paraphrase detection (Socher et al., 2011)and sentence classification (Socher et al., 2013).In this paper, we investigate only the use of CNNand RNN for medical concept normalisation, asrecursive neural networks require parse trees ofinput sentences while grammatical rules are typ-ically ignored in social media messages.

3 Neural Networks for ConceptNormalisation

Next, we introduce our medical concept normali-sation approaches based on CNN and RNN in Sec-tions 3.1 and 3.2, respectively.

3.1 CNN for Concept Normalisation

Our first approach uses CNN to learn the seman-tic representation of a social media message be-fore mapping it to an appropriate medical concept.We use a CNN architecture with a single convo-lutional and pooling layer, as shown in Figure 1.Specifically, we firstly represent a given social me-dia message of length l words (padded where nec-essary) using a sentence matrix S ∈ Rd×l:

S =

| | | |x1 x2 x3 ... xl

| | | |

(1)

where each column of S is the d-dimensional vec-tor (i.e. embedding) xi ∈ Rd of each word inthe social media message, which can be retrievedfrom pre-built word embeddings. This allows themodel to take into account semantic features fromthe embeddings of each word.

Figure 2: Our RNN architecture for medical con-cept normalisation.

We then apply a convolution operation using afilter w ∈ Rd×h to a window of h words. In par-ticular, the filter w is convolved over the sequenceof words in the sentence matrix S to create a fea-ture matrix C. Each feature ci in C is extractedfrom a window of words xi:i+h−1, as follow:

ci = f(w · xi:i+h−1 + b) (2)

where f is an activation function, such as sigmoidor tanh, and b ∈ R is a bias. Note that multi-ple filters (e.g. using different size h of window ofwords) can be used to extract multiple features.

This convolution operation enables the learningof dependencies between words from their seman-tic representation (i.e. word embeddings). In orderto capture the most important features, max pool-ing (Collobert et al., 2011) is applied to take themaximum value of each row in the matrix C:

cmax =

max(C1,:)...

max(Cd,:)

(3)

Finally, the fixed sized vector cmax forms afully connected layer, which is used as inputs ofsoftmax for multi-class classification. Indeed, thevector cmax provides a sentence representationthat captures an extensional semantic informationof the social media message for softmax to map toan appropriate medical concept.

3.2 RNN for Concept NormalisationOur second approach uses RNN to capture the se-mantics of sequences of words in a social mediamessage during normalisation. This approach isdifferent from the CNN approach (introduced Sec-tion 3.1) in that instead of using the convolutional

TwADR-S TwADR-L AskAPatient|Q| 201 1,436 8,662|VQ| 488 995 2,872|C| 58 2,200 1,036|VC | 98 2,394 1,200|Q 7→ C|avg 3.4655 0.6428 8.3610|Q 7→ C|SD 5.6264 3.3168 39.2009|Q 7→ C|min 1 0 1|Q 7→ C|max 35 58 1,073

Table 2: Statistics of the datasets used in the exper-iments. |Q|: Number of queries. |VQ|: Vocabularysize of queries. |C|: Number of target concepts.|VC |: Vocabulary size of definition of target con-cepts. |Q 7→ C|avg and |Q 7→ C|SD: Averagenumber of queries mapped to each target concept,and its standard deviation (SD). |Q 7→ C|min and|Q 7→ C|max: Mininum and maximum number ofqueries mapped to a given target concept, respec-tively.

layer to learn the representation of social mediamessages (i.e. the vector representation at the fullyconnected layer), our RNN approach deploys a re-current layer, as shown in Figure 2. Similar tothe CNN approach, we initially represent a socialmedia message of length l words using a sentencematrix S ∈ Rd×l, as in Equation (1).

Then, the recurrent layer processes the vector xi

of each word in the social media message sequen-tially and produces a hidden state output hi ∈ Rk,where k ∈ Z and k > 0. Importantly, whenprocessing each input vector xi, the hidden stateoutput hi−1 from the previous word is also recur-sively taken into account:

hi = f (hi−1,xi) (4)

where f is a recurrent unit, such as long short-term memory (LSTM) (Hochreiter and Schmidhu-ber, 1997) and gated recurrent unit (GRU) (Cho etal., 2014).

Finally, the hidden state output hl, which is theoutput from processing the last word of the socialmedia message, is used as an input of the soft-max for identifying the appropriate concept, in thesame manner as the vector at the fully connectedlayer of the CNN approach in Section 3.1.

4 Experimental Setup

4.1 Datasets

To evaluate our proposed approaches, we use threedifferent datasets (namely, TwADR-S, TwADR-L

and AskAPatient)4, where the task is to map asocial media phrase to a relevant medical con-cept. In these datasets, a given social mediaphrase is mapped to only one medical concept.Table 2 shows statistics for the three datasets.In particular, TwADR-S is the dataset providedby Limsopatham and Collier (2015a), which con-tains 201 Twitter phrases and their correspondingSNOMED-CT5 concept. The total number of tar-get concepts is 58, while on average a medicalconcept can be mapped by 3.47 queries with thestandard deviation of 5.63.

The TwADR-L dataset is our new dataset thatwe constructed from a collection of three monthsof tweets (between July and November 2015),downloaded using the Twitter Streaming API6 byfiltering using the name of a pre-defined set ofdrugs, which have been used in the literature forADR profiling (e.g. cognitive enhancers) (Benderet al., 2007). These tweets were sampled and thenannotated by undergraduate-level linguists. Thiscollection contains 1,436 Twitter phrases that canbe mapped to one of 2,220 medical concepts fromthe SIDER 4 database of drug profiles7. Note that1,947 from the 2,220 concepts are not relevant toany of the Twitter phrases.

For the AskAPatient dataset, we extracted thegold-standard mappings of social media messagesand medical concepts from the ADR annotationcollection of Karimi et al. (2015). Our AskA-Patient dataset contains 8,662 phrases8, each ofwhich can be mapped to one of the 1,036 med-ical concepts from SNOMED-CT and AMT (theAustralian Medicines Terminology). We expectthis dataset to be less difficult than TwADR-S andTwADR-L, as the nature of blog posts is less in-formal and ambiguous than Twitter messages.

For each of the datasets, we randomly divideit into ten equally folds, so that our approachesand the baselines would be trained on the samesets of data. We evaluate our approaches based onthe accuracy performance, averaged across the tenfolds. The significant difference between the per-formance of our approaches and the baselines ismeasured using the paired t-test (p < 0.05).

4TwADR-L and AskAPatient datasets are available onZenodo.org (DOI:http://dx.doi.org/10.5281/zenodo.55013).

5http://www.ihtsdo.org/snomed-ct.6https://dev.twitter.com/streaming/

overview7http://sideeffects.embl.de/8From blog posts on http://www.askapatient.

com website.

4.2 Pre-trained Word Embeddings

As our CNN (Section 3.1) and RNN (Section 3.2)approaches require word vectors as inputs, weinvestigate the use of two different pre-trainedword embeddings. The first word embeddings(denoted, GNews) are the publicly available 300-dimension embeddings (vocabulary size of 3M)that were induced from 100 billion words fromGoogle News using word2vec (Mikolov et al.,2013b)9, which has been shown to be effective forseveral tasks (Baroni et al., 2014; Kim, 2014). Thesecond word embeddings (denoted, BMC) inducedfrom 854M words of medical articles downloadedfrom BioMed Central10 by using the skip-grammodel from word2vec (with default parameters).The BMC embeddings also have 300 dimension.For the words that do not existing in any embed-dings, we use a vector of random values sampledfrom [−0.25, 0.25].

As an alternative, we also use randomly gener-ated embeddings (denoted, Rand) with 300 dimen-sions, where a vector representation of each wordis randomly sampled from [−0.25, 0.25]. This al-lows the investigation of the effectiveness of ourapproaches when the semantic information frompre-built embeddings is not available.

4.3 Parameters of Our CNN and RNNApproaches

For our CNN approach, we use the filter w withthe window size h of 3, 4 and 5, each of whichwith 100 feature maps, which have shown to be ef-fective for modelling sentences in sentiment anal-ysis (Kim, 2014). For the RNN, we use gated re-current unit (GRU) (Cho et al., 2014) and set thesize k of the output vector of each recurrent unit to100.

In addition, for both CNN and RNN, we use rec-tifier linear unit (ReLU) (Nair and Hinton, 2010)as activation functions. We also apply L2 regular-isation of the weight vectors. We train the mod-els over a mini-batch of size 50 to minimise thenegative log-likelihood of correct predictions. Thestochastic gradient descent with back-propagationis performed using Adadelta update rule (Zeiler,2012). We initially set the number of epochs fortraining both CNN and RNN approaches to be100, and allow the models to update the input

9https://code.google.com/p/word2vec/10http://www.biomedcentral.com/about/

datamining

embeddings in the sentence matrix S. Later, inSections 6.2 and 6.3, we discuss the performanceachieved as we vary the number of epochs used fortraining the models, and the performance achieveswhen we allow and do not allow the models to up-date the input embeddings, respectively.

4.4 BaselinesWe consider five different baselines as follows:

1. TF-IDF: A traditional term matching-basedapproach, using the TF-IDF score.

2. BM25: A traditional term matching-basedapproach, using the BM25 score, which hasshown to be effective for several text retrievaltasks (Robertson and Zaragoza, 2009)

3. EmbSim: The cosine similarity between theword vector representation of a social mediaphrase and the description of a medical con-cept. If the phrase (or the description) con-tains several words, we represent it by addingup the values of the same dimension of theembedding of each word.

4. DNorm: A machine learning-based ap-proach that exploits the relationships betweenwords (e.g. synonyms) learned from train-ing data (Leaman and Lu, 2014). This ap-proach achieved state-of-the-art performancefor several medical concept normalisationtasks (Suominen et al., 2013; Dogan et al.,2014). Note that we customise the open-source version11 of DNorm to enable themapping to a specific set of the target con-cepts for each dataset.

5. P-MT: The concept normalisation approachthat translates social media texts to a formalmedical text before mapping to appropriatemedical concepts using the cosine similarityof their embeddings (Limsopatham and Col-lier, 2015a). We use the variant where thetop-5 translations are used to map the med-ical concepts by taking the ranked positioninto account. We calculate the cosine sim-ilarity using either the GNews or the BMCembeddings.

6. LogisticRegression: A variant of our pro-posed approaches where we concatenate em-beddings of terms (padded where necessary)

11http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#DNorm

in each social media phrase into a fixed-sizesentence vector, before using this vector asinput features for a multi-class logistic re-gression classifier.

Another possible baseline is a word-sense dis-ambiguation system, such as IMS (Zhong and Ng,2010). Nevertheless, the results from our initialexperiments using IMS showed that it could notperform effectively on the three datasets. This isbecause the performance of IMS depends heavilyon the contexts (i.e. words surrounding the inputphrase); however, such contexts are not availablein any of the three datasets. Therefore, we do notreport the performance of IMS in this paper.

Note that for the baselines that require trainingdata (i.e. DNorm and P-MT) and our two proposedapproaches, apart from the training data provideswith each fold of the datasets, we also train themusing the descriptions of the target medical con-cepts, as these data are also used by the non-supervised baselines (i.e. TF-IDF, BM25 and Em-bSim).

5 Experimental Results

In this section, we compare the performance ofour CNN and RNN approaches for medical con-cept normalisation against the six baselines, intro-duced in Section 4.4. Table 3 compares the perfor-mances of our proposed approaches with the base-lines in terms of accuracy on the three datasets (i.e.TwADR-S, TwADR-L, AskAPatient). Overall, asexpected, the accuracy performance achieved byour approaches and the baselines on the AskA-Patient dataset is higher than the TwADR-L andTwADR-S. This is due to nature use of languagein Twitter, which is more ambiguous and infor-mal than blog posts. When comparing amongthe existing baseline approaches, we observe thatDNorm and P-MT are the most effective base-lines. In particular, DNorm outperforms the otherbaselines for the TwADR-S (accuracy 0.2983) andAskAPatient (accuracy 0.7339) datasets, whileP-MT with GNews embeddings is the most ef-fective baseline for the TwADR-L dataset (ac-curacy 0.3371). In addition, term matching-based approaches, i.e. TF-IDF (accuracy 0.1638,0.2293 and 0.5547, respectively) and BM25 (ac-curacy 0.1638, 0.2300 and 0.5546), achieve al-most similar performances, which are also com-parable to the performances of EmbSim baselines.When comparing the effectiveness of different

Approach Word EmbeddingsAccuracy

TwADR-S TwADR-L AskAPatientTF-IDF - 0.1638 0.2293 0.5547BM25 - 0.1638 0.2300 0.5546EmbSim GNews 0.2494 0.2326 0.5422EmbSim BMC 0.1348 0.2057 0.5141DNorm - 0.2983 0.3099 0.7339P-MT GNews 0.2346 0.3371 0.7235P-MT BMC 0.1248 0.3114 0.7126LogisticRegression GNews 0.3186 0.3409 0.7767LogisticRegression BMC 0.3036 0.3548 0.7752CNN Rand 0.3229• 0.4267∗◦• 0.8013∗◦•

CNN GNews 0.4174∗◦• 0.4478∗◦• 0.8141∗◦•CNN BMC 0.3921∗◦• 0.4415∗◦• 0.8139∗◦•

RNN Rand 0.2936• 0.3791∗◦• 0.7991∗◦•

RNN GNews 0.3529∗◦• 0.3882∗◦• 0.7998∗◦•RNN BMC 0.3331• 0.3847∗◦• 0.7996∗◦•

Table 3: The accuracy performance of our proposed approaches and the baselines. Significant differences(p < 0.05, paired t-test) compared to the DNorm, P-MT with GNews embeddings, and P-MT with BMCembeddings, are denoted ∗, ◦ and •, respectively.

pre-trained embeddings used in EmbSim and P-MT, we observe that GNews is more effective thanBMC for both approaches, across all of the threedatasets.

Next, we discuss the performance of our CNNand RNN approaches. From Table 3, we ob-serve that both CNN and RNN markedly outper-form all of the existing baselines for all of thethree datasets. When compared with DNorm andP-MT with GNews baselines, which are the mosteffective existing baselines, we observe that bothCNN and RNN significantly (p < 0.05, pairedt-test) outperform the two baselines for all ofthe three datasets. Indeed, for the TwADR-Ldataset, CNN with GNews (accuracy 0.4478) out-performs DNorm (accuracy 0.3099) by 44%. Inaddition, the choice of embeddings has a markedimpact on the achieved performance. In particu-lar, the GNews embeddings benefit both CNN andRNN more than the BMC embeddings, which isin line with the previous finding that GNews ismore useful than BMC for the EmbSim and P-MT baselines. On the other than, the randomlygenerated embeddings (i.e. Rand) are less useful.These results show that the semantics captured inword embeddings are useful for both CNN andRNN approaches for medical concept normalisa-tion. However, for both CNN and RNN, the choiceof embeddings that are employed has less impact

on the performance for the AskAPatient dataset,which has greater number of training data.

Furthermore, we observe that the LogisticRe-gression baseline, a variant of our proposed ap-proach that uses the multi-class logistic regressioninstead of neural networks for identifying rele-vance concepts, also outperforms the all of the ex-isting baselines. However, it performs worse thanboth CNN and RNN approaches. This shows thatwhile logistic regression can exploit the semanticsof embeddings of individual terms in social mediatexts (at the word level), it cannot learn the seman-tics of the whole phrase as effectively as CNN andRNN.

6 Analysis & Discussions

In this section, we further analyse the performanceachieved by our proposed approaches. As the per-formance achieved by our CNN approach is betterthan that of our RNN approach, we discuss onlyour CNN approach in this section.

6.1 Failure Analysis

We first discuss the results achieved by the base-lines and our CNN approach. As expected, weobserve that all approaches perform very well forthe social media phrases that lexically match withthe definition of the medical concepts, e.g. the so-cial media phrase “attention deficit disorder” is

(a) TwADR-S (b) TwADR-L (c) AskAPatient

Figure 3: The accuracy performance achieved by training with different numbers of epochs for the threedatasets.

mapped to the medical concept ‘Attention DeficitDisorder’. However, for a more complex phrases,such as “appetite on 10”, “my appetite way up”,“suppressed appetite”, the baselines, includingDNorm and P-MT, cannot effectively incorporatethe modifiers of the word “appetite” in differentphrases. For example, “appetite on 10”, “my ap-petite way up” should be mapped to ‘IncreasedAppetite’, while “suppressed appetite” should bemapped to ‘Loss of Appetite’. On the other hand,for social media phrases that do not have any termsin common with the definition of any medical con-cepts, all of the baselines performs poorly for mostof the cases. For instance, even though DNormcan learn that the term “focusing” has some rela-tionship with “concentration”, it maps any phrasescontaining “focusing” to the ‘Attention Concentra-tion Difficulty’ concept, including phrases, such as“focusing monster”, which should be mapped to‘Consciousness Abnormal’. Our CNN approachcould deal with most of these cases effectively,as it considers the semantic representation of thewhole phrase during normalisation.

6.2 Impact of Number of Training Epochs

Next, we discuss the normalisation performanceas we vary, between 1 and 200, the number ofepochs used for training our CNN model. Fig-ures 3(a), 3(b) and 3(c) show the performance interms of accuracy achieved during training andtesting for the TwADR-S, TWADR-L and AskAP-atient datasets, respectively. We observe that train-ing can be effectively achieved at around 60 - 70epochs for the TwADR-S and TwADR-L datasets,and around 40 epochs for the AskAPatient dataset,before the performance becomes stable. We noticea gap between the performance achieved duringtraining and testing, especially for the TwADR-S

DatasetAccuracy

CNN with CNN withupdated emb. fixed emb.

TwADR-S 0.4174 0.4369TwADR-L 0.4478 0.4590AskAPatient 0.8141• 0.7869

Table 4: The accuracy performance of our CNNapproach with the GNews embeddings, when al-lowing (updated emb.) and not allowing (fixedemb.) the model to update the input word embed-dings. Significant difference (p < 0.05, paired t-test) between the performance achieved by the twovariants, on each dataset, is denoted •.

and TwADR-L datasets; however, this gap shouldbe narrower if more training data are available.

6.3 Impact of Fixed Embeddings

In this section, we compare the performance ofour CNN with GNews embeddings when we al-low (updated emb.) and when we do not allow(fixed emb.) the input embeddings to be updated.Table 4 reports the accuracy performance of thetwo variants for the three datasets. We observe thatfor TwADR-S and TwADR-L datasets, which aresmaller datasets (dataset size of 201 and 1,436, re-spectively), a better performance can be achievedif the model is not allowed to update the embed-dings of the input phrases. In contrast, for theAskAPatient dataset (dataset size of 8,662), allow-ing the model to update the embeddings resultsin a significantly (paired t-test, p < 0.05) betterperformance. We observe the same trends of per-formance when using BMC embeddings. Theseresults suggest that for small datasets, we shouldleverage semantics from pre-built word embed-dings and do not allow the model to update the

embeddings. Meanwhile, for a larger dataset, fur-ther performance improvement can be achieved byallowing the model to update the embeddings.

7 Conclusions

We have motivated the importance of semanticswhen normalising medical concepts in social me-dia messages. In particular, as social media mes-sages are typically ambiguous, we argue that ef-fective concept normalisation should deal withthem at the semantic level. To do so, we intro-duced two neural network-based approaches formedical concept normalisation, which are basedon convolutional and recurrent neural network ar-chitectures. Our experimental results evaluated onthree different social media datasets showed thatboth of our approaches markedly and significantlyoutperformed several strong baselines, includingan existing approach that achieved state-of-the-artperformance on several medical concept normal-isation tasks. From the analysis of the results,we found that while some existing approaches cancapture synonyms of words, they could not lever-age the semantic meaning of the social media mes-sage. Our approaches overcomes this by learn-ing the semantic representation of the social mediamessage before passing it to a classifier to matchan appropriate concept.

Acknowledgements

The authors wish to thank funding support fromthe EPSRC (grant number EP/M005089/1).

ReferencesAlan R Aronson. 2001. Effective mapping of biomedi-

cal text to the umls metathesaurus: the metamap pro-gram. In AMIA, pages 17–21.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Timothy Baldwin, Paul Cook, Marco Lui, AndrewMacKinlay, and Li Wang. 2013. How noisy so-cial media text, how diffrnt social media sources. InIJCNLP, pages 356–364.

Marco Baroni, Georgiana Dinu, and GermanKruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In ACL, pages238–247.

Andreas Bender, Josef Scheiber, Meir Glick, John WDavies, Kamal Azzaoui, Jacques Hamon, LaszloUrban, Steven Whitebread, and Jeremy L Jenk-ins. 2007. Analysis of pharmacology data and theprediction of adverse drug reactions and off-targeteffects from chemical structure. ChemMedChem,2(6):861–873.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches. arXiv preprint arXiv:1409.1259.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. J. Mach. Learn. Res., 12:2493–2537.

Rezarta Islamaj Dogan, Robert Leaman, and ZhiyongLu. 2014. Ncbi disease corpus: a resource for dis-ease name recognition and concept normalization.Journal of biomedical informatics, 47:1–10.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. In ACL, pages 655–665.

Sarvnaz Karimi, Alejandro Metke-Jimenez, MadonnaKemp, and Chen Wang. 2015. Cadec: A corpus ofadverse drug event annotations. Journal of biomed-ical informatics, 55:73–81.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In EMNLP, pages 1746–1751.

Robert Leaman and Zhiyong Lu. 2014. Automateddisease normalization with low rank approxima-tions. In ACL, pages 24–28.

Robert Leaman, Rezarta Islamaj Dogan, and Zhiy-ong Lu. 2013. Dnorm: disease name normaliza-tion with pairwise learning to rank. Bioinformatics,29(22):2909–2917.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In ACL, pages 302–308.

Nut Limsopatham and Nigel Collier. 2015a. Adaptingphrase-based machine translation to normalise med-ical terms in social media messages. In EMNLP,pages 1675–1680.

Nut Limsopatham and Nigel Collier. 2015b. Towardsthe semantic interpretation of personal health mes-sages from social media. In Proceedings of the ACMFirst International Workshop on Understanding theCity with Urban Informatics, UCUI ’15, pages 27–30, New York, NY, USA. ACM.

Zhiyong Lu, Hung-Yu Kao, Chih-Hsuan Wei, Min-lie Huang, Jingchen Liu, Cheng-Ju Kuo, Chun-Nan Hsu, Richard TH Tsai, Hong-Jie Dai, NaoakiOkazaki, et al. 2011. The gene normalization taskin biocreative iii. BMC bioinformatics, 12(Suppl8):S2.

Andrew McCallum, Kedar Bellare, and FernandoPereira. 2012. A conditional random field fordiscriminatively-trained finite-state string edit dis-tance. arXiv preprint arXiv:1207.1406.

Alejandro Metke-Jimenez and Sarvnaz Karimi. 2015.Concept extraction to identify adverse drug reac-tions in medical forums: A comparison of algo-rithms. arXiv preprint arXiv:1504.06936.

Tomas Mikolov, Quoc V Le, and Ilya Sutskever.2013a. Exploiting similarities among lan-guages for machine translation. arXiv preprintarXiv:1309.4168.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In NIPS, pages 3111–3119.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In ICML, pages 807–814.

Karen O’Connor, Pranoti Pimpalkhute, Azadeh Nik-farjam, Rachel Ginn, Karen L Smith, and GracielaGonzalez. 2014. Pharmacovigilance on twitter?mining tweets for adverse drug reactions. In AMIA,volume 2014, pages 924–933.

Alexandre Passos, Vineet Kumar, and Andrew Mc-Callum. 2014. Lexicon infused phrase embed-dings for named entity resolution. arXiv preprintarXiv:1404.5367.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. GloVe: Global vectors for wordrepresentation. In EMNLP, pages 1532–1543.

Eric Sven Ristad and Peter N Yianilos. 1998. Learningstring-edit distance. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 20(5):522–532.

Stephen Robertson and Hugo Zaragoza. 2009. Theprobabilistic relevance framework: BM25 and be-yond. Now Publishers Inc.

Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-pher D Manning, and Andrew Y Ng. 2011. Dy-namic pooling and unfolding recursive autoencodersfor paraphrase detection. In NIPS, pages 801–809.

Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In EMNLP, pages 1631–1642.

Hanna Suominen, Sanna Salantera, Sumithra Velupil-lai, Wendy W Chapman, Guergana Savova, NoemieElhadad, Sameer Pradhan, Brett R South, Danielle LMowery, Gareth JF Jones, et al. 2013. Overviewof the share/clef ehealth evaluation lab 2013.In Information Access Evaluation. Multilinguality,Multimodality, and Visualization, pages 212–231.Springer.

Yoshimasa Tsuruoka, John McNaught, Sophia Ana-niadou, et al. 2007. Learning string simi-larity measures for gene/protein name dictionarylook-up using logistic regression. Bioinformatics,23(20):2768–2774.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In ACL, pages 384–394.

Matthew D Zeiler. 2012. ADADELTA: anadaptive learning rate method. arXiv preprintarXiv:1212.5701.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In ACL, pages 78–83.

normalising medical concepts in social media texts by...

Documents