unsupervised medical entity recognition and linking in chinese...

14
Research Article Unsupervised Medical Entity Recognition and Linking in Chinese Online Medical Text Jing Xu , Liang Gan, Mian Cheng, and Quanyuan Wu School of Computer, National University of Defense Technology, Changsha, China Correspondence should be addressed to Jing Xu; [email protected] Received 28 September 2017; Accepted 2 January 2018; Published 18 April 2018 Academic Editor: Robertas Damaševičius Copyright © 2018 Jing Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Online medical text is full of references to medical entities (MEs), which are valuable in many applications, including medical knowledge-based (KB) construction, decision support systems, and the treatment of diseases. However, the diverse and ambiguous nature of the surface forms gives rise to a great diculty for ME identication. Many existing solutions have focused on supervised approaches, which are often task-dependent. In other words, applying them to dierent kinds of corpora or identifying new entity categories requires major eort in data annotation and feature denition. In this paper, we propose unMERL, an unsupervised framework for recognizing and linking medical entities mentioned in Chinese online medical text. For ME recognition, unMERL rst exploits a knowledge-driven approach to extract candidate entities from free text. Then, the categories of the candidate entities are determined using a distributed semantic-based approach. For ME linking, we propose a collaborative inference approach which takes full advantage of heterogenous entity knowledge and unstructured information in KB. Experimental results on real corpora demonstrate signicant benets compared to recent approaches with respect to both ME recognition and linking. 1. Introduction In recent years, due to the rapid development of tech- niques and the increasing concern of people with their health, many medical websites have emerged which not only provide diverse medical information, including health knowledge and medical news, but also provide the online consultation service about diseases. Some well-known Chinese medical websites are Family-doctor (http://www .familydoctor.com.cn/), Muzhi-doctor (http://muzhi.baidu .com/), Qiuyi (http://www.qiuyi.cn/) and so on, which pro- duce a large amount of medical question and answer (Q&A) data based on real patients and doctors every day. This data contains many real individual cases with high medical value, motivating many medical applications, such as disease prevention and self-treatment. Medical Q&A data, as unstructured text expression, contains many diverse and ambiguous references to medical entities. The diversity is that an entity is referred to in multi- ple ways, including aliases and abbreviations. The ambiguity means that dierent entities have the same surface form. For example, (epidemic) could refer to either a disease or a lm. This gives rise to a great diculty in ME identica- tion. Only using entity recognition technology is limited in terms of its ability to eectively mine the data. To fully mine and exploit useful medical knowledge, ME recognition and linking is a good solution. Specically, it rst detects and classies the ME mentions in text and then understands their meanings by linking the mentions to the correct entities in a given KB. For example, given a text such as MF, 骨髓, (the symptom of MF, namely, mye- lobrosis, is splenomegaly), ME recognition determines that the strings MFand 骨髓纤维(myelobrosis) are dis- eases and that (splenomegaly) is a symptom. ME linking performs the next step, inferring that MFand 纤维(myelobrosis) actually refer to an entity at URL http://baike.baidu.com/item/骨髓纤维and that (splenomegaly) refers to an entity at URL http:// baike.baidu.com/item/.Medical entity recognition (MER) is a well-known problem which has been studied for decades. Medical entity linking (MEL) is a newer research issue which has attracted Hindawi Journal of Healthcare Engineering Volume 2018, Article ID 2548537, 13 pages https://doi.org/10.1155/2018/2548537

Upload: others

Post on 12-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

Research ArticleUnsupervised Medical Entity Recognition and Linking in ChineseOnline Medical Text

Jing Xu Liang Gan Mian Cheng and Quanyuan Wu

School of Computer National University of Defense Technology Changsha China

Correspondence should be addressed to Jing Xu jingxunudteducn

Received 28 September 2017 Accepted 2 January 2018 Published 18 April 2018

Academic Editor Robertas Damaševičius

Copyright copy 2018 Jing Xu et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Online medical text is full of references to medical entities (MEs) which are valuable in many applications including medicalknowledge-based (KB) construction decision support systems and the treatment of diseases However the diverse andambiguous nature of the surface forms gives rise to a great difficulty for ME identification Many existing solutions have focusedon supervised approaches which are often task-dependent In other words applying them to different kinds of corpora oridentifying new entity categories requires major effort in data annotation and feature definition In this paper we proposeunMERL an unsupervised framework for recognizing and linking medical entities mentioned in Chinese online medical textFor ME recognition unMERL first exploits a knowledge-driven approach to extract candidate entities from free text Then thecategories of the candidate entities are determined using a distributed semantic-based approach For ME linking we propose acollaborative inference approach which takes full advantage of heterogenous entity knowledge and unstructured information inKB Experimental results on real corpora demonstrate significant benefits compared to recent approaches with respect to bothME recognition and linking

1 Introduction

In recent years due to the rapid development of tech-niques and the increasing concern of people with theirhealth many medical websites have emerged which notonly provide diverse medical information including healthknowledge and medical news but also provide the onlineconsultation service about diseases Some well-knownChinese medical websites are Family-doctor (httpwwwfamilydoctorcomcn) Muzhi-doctor (httpmuzhibaiducom) Qiuyi (httpwwwqiuyicn) and so on which pro-duce a large amount of medical question and answer(QampA) data based on real patients and doctors everyday This data contains many real individual cases withhigh medical value motivating many medical applicationssuch as disease prevention and self-treatment

Medical QampA data as unstructured text expressioncontains many diverse and ambiguous references to medicalentities The diversity is that an entity is referred to in multi-ple ways including aliases and abbreviations The ambiguitymeans that different entities have the same surface form For

example ldquo传染病rdquo (epidemic) could refer to either a diseaseor a film This gives rise to a great difficulty in ME identifica-tion Only using entity recognition technology is limited interms of its ability to effectively mine the data To fully mineand exploit useful medical knowledge ME recognition andlinking is a good solution Specifically it first detects andclassifies the ME mentions in text and then understands theirmeanings by linking the mentions to the correct entities in agiven KB For example given a text such as ldquoMF 即骨髓纤维化症状为脾肿大⋯rdquo (the symptom of MF namely mye-lofibrosis is splenomegaly) ME recognition determines thatthe strings ldquoMFrdquo and ldquo骨髓纤维化rdquo (myelofibrosis) are dis-eases and that ldquo脾肿大rdquo (splenomegaly) is a symptom MElinking performs the next step inferring that ldquoMFrdquo and ldquo骨髓纤维化rdquo (myelofibrosis) actually refer to an entity atURL ldquohttpbaikebaiducomitem骨髓纤维化rdquo and thatldquo脾肿大rdquo (splenomegaly) refers to an entity at URL ldquohttpbaikebaiducomitem脾大rdquo

Medical entity recognition (MER) is a well-knownproblem which has been studied for decades Medical entitylinking (MEL) is a newer research issue which has attracted

HindawiJournal of Healthcare EngineeringVolume 2018 Article ID 2548537 13 pageshttpsdoiorg10115520182548537

much attention because of its importance in many applica-tions such as understanding medical text KB constructionand QampA systems However existing works on this topicmainly focus on well-formed English text such as electronicpatient records and medical reports Few studies havefocused on Chinese online medical QampA text data Theresearch challenges can be summarized as (1) the onlinemedical QampA text is characterized by unreliable tokeniza-tion abbreviation and misspellings This gives rise to a greatdifficulty in recognizing the correct entity boundary (2) It isgenerally brief lacking rich context information This affectsthe availability of context that can be leveraged to assistthe linking (3) Compared to English Chinese has morecomplicated syntax rules so it is difficult to use solutionsfor the English language

In this paper we design an unsupervised framework thatrecognizes and links ME mentions in Chinese online medicaltext namely unMERL To the best of our knowledge this isthe first paper that describes such a comprehensive frame-work for Chinese medical text The main contributions ofthis work are as follows

(1) unMERL utilizes a knowledge-driven approach todetect the ME boundaries which incorporates theoffline and online process thereby significantlyimproving recognition performance In additionthe strategy exploiting the dependency relationshipsbetween words can capture the nested and combinedmedical entities well

(2) unMERL uses an improved classifier based on textfeature computation and semantic signature similar-ity which can efficiently classify medical entities andfurther filter nonmedical entities

(3) The linking approach synthetically considers thename similarity entity popularity category consis-tency context similarity and semantic correlationbetween entities which can better distinguish anddetermine the candidate entities In addition to solvethe imperfection problem of the KB we introducean incremental evidence mining process therebysignificantly improving the linking performance

(4) We extensively evaluate unMERL for the MErecognition and linking task over real datasets Theexperiment results show that unMERL can achievea significantly higher performance compared tocurrent mainstream methods

The remainder of this paper is organized as followsSection 2 discusses related works Section 3 presents ourframework in detail Section 4 describes our experimentsresults and discussion and Section 5 concludes this paper

2 Related Works

Entity recognition has been widely studied in the context ofthe medical domain Early works on this topic relied onheuristic rules and lexical resources [1ndash4] Based on the name

characteristics of medical entities the researchers encodedand mapped the terms in clinical text to the lexical resourcesIn particular the widely applied systems include MedLEE[1] EDGAR [2] and MetaMap [3] The most well-knownmedical lexicons included MetaThesaurus [5] MeSH (Medi-cal Subject Heading) [6] and SNOMED-CT (SystematizedNomenclature of Medicine-Clinical Terms) [7] The Chineseversion of SNOMED-CT was published in 1997 The rule-based and lexicon-based systems depended on name regular-ity and lexicon size which were restricted to extracting thelimited and normative entities However by incorporatingthe dependency relationships between words and the onlinedetection process with a search engine our approach solvesthese problems well

More recently Zhang and Elhadad proposed an unsu-pervised approach to biomedical-named entity recognitionleveraging terminologies syntactic knowledge and corpusstatistics [8] In addition the bootstrapping algorithmattracted much attention in the context of medical entity rec-ognition [9 10] Bootstrapping is an unsupervised machinelearning approach which starts from small sets of seeds orrules and iteratively labels the corpus with them by patternmatching [11] However it relies on the quality of seedsand the normalization of the corpus which easily producessemantic deviation due to involvement of the incorrect seedcategories and irregular context information

In recent years many researchers have focused on usingstatistical machine learning approaches in the medical fieldThe ME recognition problem is transformed into a sequenceannotation or a classification problem The lexical syntacticand semantic features of words are used for training variouslearning models such as HMM (hidden Markov model)[12 13] MEM (maximum entropy model) [14 15] CRF(conditional random field) [13 16ndash19] and (structured)SVM (support vector machine) [19] In addition to alleviatethe limitation of a single model some researchers proposedthe cascading methods [15 20] which combine multiplemodels including CRF (structured) SVM and MEM How-ever the supervised nature of the machine learning-basedapproaches relies on a large amount of training corpuswhich need to be annotated by humans Besides it is difficultfor the feature set to cover all entity types As a result theyare usually task-dependent To solve this problem we pro-pose an unsupervised approach which leverages syntacticknowledge corpus statistics and lexical resources forME recognition

Medical entity linking is a newer problem Some effectiveapproaches to English corpora have been proposed Glavasexploited semantic textual similarity for linking entity men-tions in clinical text [21] Zheng et al proposed a collectiveinference approach which leverages semantic informationand structures in ontology to solve the entity linking problemfor biomedical literature [22] Wang et al proposed a graph-based linking approach which first constructs graphs formentions KB and candidates and then exploits the informa-tion entropy and similarity algorithm to link biomedicalentities [23] These approaches are dependent on the contextand KB Therefore the noise and lack of information in thecontext reduce the accuracy of the linking In addition the

2 Journal of Healthcare Engineering

graph-based approaches have a high computation costand the imperfection of KB also impacts the performanceof the linking

Our linking approach synthetically considers multipleentity knowledge which is more accurate in distinguishingand determining the candidate entities with lower computa-tional costs Moreover our solution adds the step of extract-ing the relevant context to solve the noise problem Tooptimize the local KB we still introduce an incrementalevidence mining process with the third KB Entity linkingin the Chinese medical domain has been studied less thanentity linking in the English medical domain To ourknowledge our linking approach is the first solution forChinese online medical text

3 UnMERL

The framework of unMERL is shown in Figure 1 in whichunMERL consists of two modules The ME recognitionmodule consumes an input corpus and performs entityboundary detection and entity classification The output is aset of medical entities and categories For each recognizedmedical entity the ME linking module generates the candi-dates from the KB and then acquires the target object byranking them

31 Medical Entity Recognition The ME recognition moduleaims to detect and classify all ME mentions in the input cor-pus Named entity recognition (NER) [24] involves two mainsteps detecting entity boundaries and classifying the entitiesinto predefined categories Based on the thesis our ME rec-ognition module is implemented in the sequence of two sep-arate processes boundary detection and entity classification

311 Boundary Detection This step requires the detection ofboundaries of medical entities collecting candidates forentity classification In our solution unMERL exploits aknowledge-driven method mapping the input text to con-cepts in the lexical resources Compared to the existingdictionary-based approaches our approach differs in thefollowing ways (1) The entity candidates are identified basedon the dependency relationships between words The strat-egy can well capture the combined and nested entities andreduce the computational cost of the subsequent process bydownsizing the candidate set (2) The search engine isincluded as a lexical resource which breaks the condition-ality of the limited terms in the dictionary and has goodperformance in terms of its ability to detect variationaland rare entity names The detection process is roughlydivided into two stages candidate entity generation andmedical entity detection

(1) Candidate Entity Generation Through corpus analysiswe find that a long medical entity is usually segmented intoseveral fragments by a common nature language processingtool The POS tag of each fragment is included in Table 1Moreover these fragments generally have an attributivedependency relationship For example the text ldquo骨髓纤维化简称髓纤 是一种骨髓增生性疾病 武汉协和医院有很

好的治疗效果rdquo (myelofibrosis or MF in brief is a myelo-proliferative disease for which Wuhan Concorde Hospitalhas a very good therapeutic effect) parsed by the HanLPdependency parser (httphanlplinrunsoftcom) is shownin Figure 2 The dependency labels are shown in Table 1Based on the hypothesis that entities should be noun phrases(NPs) from the automatically parsed dependency trees weextract native NPs as candidate entities A native NP is asingle noun (without the attributive modifiers) or a maxi-mum noun phrase with the POS tags in Table 1 and thedependency ldquoATTrdquo The candidate entities extracted fromthe above text are shown in the third row of Table 1

However not all noun phrases are medical entitiesIn order to remove the nonmedical NPs we employ aknowledge-driven method whose aim is to discover theconcepts in the lexical resources referred to in the textHere we use Chinese SNOMED-CT [25] the medical KBof Baidu Baike (httpsbaikebaiducomsciencemedical)and Sogou medical dictionaries (httppinyinsogoucomdictcateindex132rf=dictindex) as the offline lexicalresources (LRs) In order to mitigate the limited coverage ofthe above resources we still use Baidu Search (httpswwwbaiducom) as an online lexical resource to help recognizethe medical entities

(2) LR Description As mentioned before Chinese SNOMED-CT translated from SNOMED-CT (English) is a standard ofclinical medicine and contains more than 140000 clinicalterms covering most aspects of clinical information Tocorrect incorrect terms in the translated version we add the

Corpus

Candidate entitygeneration

Offline detection

MErecognition

Boundarydetection Lexical

resources

Signature generation

Category decision

Entityclassification

Candidate entity generation

Candidate entity rankingby collaborative inference

Control flow

MElinking

Data flow

Results

Knowledgebase

Searchengine

Online detection

Seed term collection

Figure 1 Architecture of unMERL

3Journal of Healthcare Engineering

medical KB of Baidu Baike and Sogou medical dictionariesBaidu Baike contains more than 25000 medical terms editedby authoritative organizations and experts Sogou medicaldictionaries as the lexicon resource of Sogoursquos input methodcollect data from multiple medical websites Baidu Search isthe largest Chinese search engine using which we can obtaininformation that standardized LRs do not cover such asemerging rare and variational medical entities

(3) LR Preprocessing Considering the heterogeneity andredundancy of the above offline LRs we extract and fusethe medical terms from them to build a dictionary In partic-ular we select specific categories of interest which are alsothe goal of our entity classification Table 2 presents thestatistics in the self-built dictionary In addition to improveretrieval efficiency we build indices by using the first pho-netic alphabet of each term

tm =maxOccur r1 r2hellip rj

rj =LCS tk sk k isin K if Len LCS tk sk gt 1LCS tk ti k i isin K k ne i

1

Given a candidate entity the results returned by BaiduSearch contain not only the objective medical term but alsoother noise information that impacts the performance ofthe entity recognition Therefore we need to process thesearch results to obtain an unmixed medical term Based onthe common knowledge that there are more correct resultsthan incorrect results the method is implemented based oncorpus statistics and ldquoLCSrdquo (a function of achieving thelongest common substring) as shown in (1) Given thesearch result set S = tk sk K

k=1 (tk represents a title and skrepresents a summary) we first get the kernel term fromeach result by using the ldquoLCSrdquo function However not allsummaries contain the kernel terms in the titles Thereforewe add the process LCS tk ti for the search results without

common substring Finally in the kernel term set we selectthe most frequent term tm as a correct medical term

In addition considering that the search engine has nodistinguishing ability to filter the nonmedical entities in thecandidate set we establish a medical keyword set Thisincludes ldquo医rdquo (medicine) ldquo药rdquo (drug) ldquo病rdquo (disease) andldquo症rdquo (symptom) If a search result contains one or morekeywords in the above set we identify the candidate as amedical entity If not it is removed as a nonmedical entity

(4) Medical Entity Detection Once the medical terms areacquired from the offline and online LRs detecting themedical entities from the candidate set can be performedBased on the different characteristics of LRs (that the offlineLRs have high accuracy but limited coverage and the onlineLR have high coverage but lower accuracy) we divide thedetection process into offline and online processes Giventhe candidate set unMERL first performs the offlinedetection with the self-built dictionary For the outputnonmedical candidates unMERL performs the onlinedetection with Baidu search engine Here we exploit thestring matching and text distance constraint to implementthe detection process

Simr tcm tm = Len LCS tcm tmmin Len tcm Len tm

2

Dis tcm tm = Loc dtcm tcm minus Loc dtcm tm 3

Table 1 Constraints on POS tags description of dependency labels and candidate entities of the sentence in Figure 2

Notation Description

POS tagsf (preposition of locality) m (measure word) b (distinguishing word) rr (personal pronoun) v (verbal word)

gblowast (word related to biology) or nlowast (noun)

Dependency labelsHED (head) SBV (subject-verb) VOB (verb-object) ATT (attribution) COO (coordination)

RAD (right adjunct)

Candidate list骨髓纤维化 (myelofibrosis) 髓纤 (MF) 骨髓增生性疾病 (myeloproliferative disease)

武汉协和医院 (Wuhan Concorde Hospital) 治疗效果 (therapeutic effect)

HEDATT SBV

VOBATT

COO

ATT ATTATT

SBV RAD

VOBCOO

ATT ATTATT

VOB

Rootn nz v n n wp v m q n v nz wp ni v a u v n

Figure 2 An example sentence with dependency parsing and POS tagging

Table 2 Statistics of the user-defined dictionary

Category Term number

Body 1802

Disease 48120

Symptom 3698

Medicine 42047

Treatment 7403

Check 768

4 Journal of Healthcare Engineering

Given a candidate tcm and a medical term tm in LRs weuse the length proportion of their longest common substringand the shorter term as their name similarity as (2) Thesimilarity computation can capture the nested entities Forexample for a candidate ldquo胸肌内膜炎rdquo (endomysitis) if tmis ldquo胸肌rdquo (chest muscle) we can regard ldquo胸肌rdquo (chest mus-cle) as a nested entity In addition considering that someterms and their fractional terms exist together in a text weuse the text distance constraint to improve the detectorrsquosaccuracy For example a text contains both ldquo头孢rdquo (cephalo-sporin) and ldquo头孢拉定rdquo (cefradine) and the comparedmedical term is ldquo头孢拉定rdquo (cefradine) For the candidateldquo头孢rdquo (cephalosporin) if the text distance constraint is notused the output medical entity is ldquo头孢拉定rdquo (cefradine)Obviously this is the incorrect surface form for the candidateldquo头孢rdquo (cephalosporin) In (3) the sign dtcm represents thetext containing the candidate entity Function ldquoLocrdquo com-putes the location of the second parameter in the first param-eter Using the above two equations the specific detectionprocess is as Algorithm 1

In Algorithm 1 the input includes the candidate entityset the medical term set from the offline and online LRsand the input text The output is a set of medical entitiesGiven a candidate we first compute its name similarity witheach medical term inMT If they are the same the candidateis regarded as a medical entity If not we select the medicalterms exceeding the predetermined similarity threshold θfor performing the text distance calculation For each medicalterm ranked by name similarity if the text distance betweenthe medical term and the candidate is under the thresholdδ the medical term is output as the correct expression of thiscandidate In addition considering the existing of misspelledME names we add the ldquoDiffrdquo function to recognize themThis involves counting the number of different charactersbetween a candidate and the medical term (with the highestname similarity) If the number is less than the threshold ϵ(in our experiments it is set to the number of half the char-acters in a medical term in our experiments) we output themedical term as the correct expression of this candidateFor example for a candidate ldquo头孢拉丁rdquo (cefradine) thecompared medical term is ldquo头孢拉定rdquo (cefradine) meetingthe above condition Therefore we output this medical terminstead of the candidate

312 Entity Classification Entity categories are additionalinformation for characterizing the entities mentioned Theyare essential ingredients in many medical applications suchas medical dictionaries medical KBs and medical servicesystems Our classification approach is partly inspired bythe use of seed knowledge and context signature similarityin [8] The difference between our approach and the classifi-cation approach in [8] is in the following four ways (1) In thecollection of seed terms we use the framework informationin the terminology instead of the category tags reducingclassification error Meanwhile we classify some ME men-tions based on text feature computation thereby avoidingthe constraint of dissimilar context and the lack of context(2) Signature vector computation is refined through wordembedding which can better measure the semantic similarity

than the TF-IDF method (3) The filtering threshold isautomatically generated by averaging the signature similarityof seed terms thereby reducing labor costs and increasingfiltering accuracy (4) The seed set is scaled up continuallyto improve coverage The classification is implemented byapplying the following three steps seed term collectionsignature generation and category decision

(1) Seed Term Collection This step involves collecting seedterms for entity categories based on which the signaturevectors of the categories will be generated in the subsequentstep Here we utilize Baidu Baike to automatically gatherthe seed terms In an in-depth analysis we find that themedical entities of the same class have similar frameworkinformation in Baidu Baike which is more accurate than onlyusing category tag in identifying the entity category There-fore we design a text feature computation-based seed collec-tion approach Here we define T = s a d c as the set oftext features with a subtitle ldquosrdquo the attribute names ldquoardquo ofthe infobox the directory names ldquodrdquo of the content and thecategory tags ldquocrdquo in the entry page of Baidu Baike Theapproach is implemented as follows (1) From the self-builtdictionary we randomly select 50 terms from each categoryto extract and fuse their text features as the category signa-ture In particular we exploit the perfect string matchingalgorithm to produce unambiguous Baidu Baike entries(2) For each candidate we also crawl the feature informationfrom Baidu Baike Then we calculate its string similaritywith all category signatures by using (4) and classifying thiscandidate to the category with the highest similarity In par-ticular the signs Wc and Wcm represent the word sets of acategory signature and the feature information for a candi-date respectively Finally the classified candidate entitiesare used as the seeds for the category signature computationin the next step

Input candidate set C medical term set MT text set TOutput medical entity set ME

1 for ci isin C mtj isinMT tz isin T do2 if ci ==mtj then3 MElarr ci4 end if5 set M =empty6 if Sim cimtj gt θ then7 Mlarr Rank mtj 8 end if9 for mk isinM do10 if tz contains mk and Dis cimk lt δ or

Diff cimk lt ϵ then11 MElarrmk12 break13 end if14 end for15 end for16 return ME

Algorithm 1 Medical entity detection

5Journal of Healthcare Engineering

Simc WcmWc = Wcm capWc

Wcm4

(2) Signature Generation This step involves transforming themedical terms (including candidates and seeds) and catego-ries into signature vectors Here we use the phrase ldquotermsignaturerdquo to denote the vector of a ME mention or a seedterm Considering that the internal words have descriptiveability for a term we use the internal and context words forsignature generation To capture the semantic similaritybetween words we exploit a word embedding approach tocalculate the vector value of a word Here we use the Word2-Vec model a distributed representation model to express thewords in text as vectors based on deep learning technology[26] The training corpus is the input corpus the descriptioncontent of all medical terms in Baidu Baike and the searchresults of Baidu Search The final term signature vector iscomputed by averaging all word vectors in accordance with(5) In addition we use the phrase ldquocategory signaturerdquo todenote the vector of an entity category This is computedby averaging the signature vectors of all seed terms belongingto the same class following (5)

(3) Category Decision Once all term signatures and categorysignatures are generated the category of each candidate isidentified by using Algorithm 2 The symbol description isshown in Table 3 The similarity calculation between vectorsadopts a cosine similarity algorithm following (6) ThoughAlgorithm 2 each candidate exceeding the filtering thresholdis assigned to the category with the highest similarity Inaddition the filtering threshold is automatically computedby averaging the signature similarity of seed terms following(7) In particular ∣c∣ is the number of seed terms belongingto a class corresponding to ∣tk∣ in Algorithm 2 C2

∣c∣ is thecombination function counting the number of combina-tions of any two seeds Finally to increase the coverage ofthe seed set we add the classified candidate to the relevantseed signature set and then update the filtering thresholdand the category signature

vc = 1S

misinS

vm 5

Simcos va vb =I

i=1 vai times vbi

I

i=1 vai2 times I

i=1 vbi2 6

F vi vj = 1C2

c

c

ij=1inejSimcos vi vj 7

32 Medical Entity Linking We use the medical KB ofBaidu Baike as a basic KB To increase the accuracy ofthe similarity calculation we use the medical KB ofHudong Baike (httpwwwbaikecomsitecategory-10html)to expand the description information of the entities in thisbasic KB The method is as follows for each entity in KBwe acquire its page from Hudong Baike and then extractthe description content and category information

In accordance with the procedure of entity linking [27]the ME linking module has two stages candidate entitygeneration and ranking For each ME mention the modulefirst obtains its candidate entities from the KB and thenselects the top candidate (after ranking) as the linking entityThe mentions without linking entities are regarded as NIL

321 Candidate Entity Generation In this stage our goal isto increase the probability of the candidate set containing atarget entity and to control its size To accomplish the firstgoal we use the fuzzy string matching algorithm to computethe name similarity between a mention and all entities in theKB in accordance with (8) The function ldquoMCCrdquo acquires themost common characters between two strings in order Itcan well process the abbreviations and acronyms besidesthe standard names The entities exceeding the similaritythreshold α are included in the candidate set Howeverthis algorithm may result in a large candidate set

Table 3 Symbol description in Algorithm 2

Symbol Description

MA set containing each candidate entity mi and its

signature vector smi

F A threshold set filtering the nonmedical entities

tk A seed signature set of the same class

sa sb Seed signature

cj ct Category name

scj sct Category signature of cj or ct

f cj Filtering threshold of cj

Input candidate set M seed signature set T categorysignature set C

Output medical entity-category set E1 for mi smi

isinM do2 set F =empty D =empty3 for tk isin T sa sb isin tk do4 Flarr F sa sb 5 end for6 for scj isin C f cj isin F do

7 if Simcos smi scj gt f cj then

8 Dlarr Simcos smi scj cj

9 end if10 end for11 if D neempty then12 ct larr arg max D 13 Elarr mi ct 14 T larr mi 15 update sct isin C by (5)16 end if17 end for18 return E

Algorithm 2 Medical entity classification

6 Journal of Healthcare Engineering

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

much attention because of its importance in many applica-tions such as understanding medical text KB constructionand QampA systems However existing works on this topicmainly focus on well-formed English text such as electronicpatient records and medical reports Few studies havefocused on Chinese online medical QampA text data Theresearch challenges can be summarized as (1) the onlinemedical QampA text is characterized by unreliable tokeniza-tion abbreviation and misspellings This gives rise to a greatdifficulty in recognizing the correct entity boundary (2) It isgenerally brief lacking rich context information This affectsthe availability of context that can be leveraged to assistthe linking (3) Compared to English Chinese has morecomplicated syntax rules so it is difficult to use solutionsfor the English language

In this paper we design an unsupervised framework thatrecognizes and links ME mentions in Chinese online medicaltext namely unMERL To the best of our knowledge this isthe first paper that describes such a comprehensive frame-work for Chinese medical text The main contributions ofthis work are as follows

(1) unMERL utilizes a knowledge-driven approach todetect the ME boundaries which incorporates theoffline and online process thereby significantlyimproving recognition performance In additionthe strategy exploiting the dependency relationshipsbetween words can capture the nested and combinedmedical entities well

(2) unMERL uses an improved classifier based on textfeature computation and semantic signature similar-ity which can efficiently classify medical entities andfurther filter nonmedical entities

(3) The linking approach synthetically considers thename similarity entity popularity category consis-tency context similarity and semantic correlationbetween entities which can better distinguish anddetermine the candidate entities In addition to solvethe imperfection problem of the KB we introducean incremental evidence mining process therebysignificantly improving the linking performance

(4) We extensively evaluate unMERL for the MErecognition and linking task over real datasets Theexperiment results show that unMERL can achievea significantly higher performance compared tocurrent mainstream methods

The remainder of this paper is organized as followsSection 2 discusses related works Section 3 presents ourframework in detail Section 4 describes our experimentsresults and discussion and Section 5 concludes this paper

2 Related Works

Entity recognition has been widely studied in the context ofthe medical domain Early works on this topic relied onheuristic rules and lexical resources [1ndash4] Based on the name

characteristics of medical entities the researchers encodedand mapped the terms in clinical text to the lexical resourcesIn particular the widely applied systems include MedLEE[1] EDGAR [2] and MetaMap [3] The most well-knownmedical lexicons included MetaThesaurus [5] MeSH (Medi-cal Subject Heading) [6] and SNOMED-CT (SystematizedNomenclature of Medicine-Clinical Terms) [7] The Chineseversion of SNOMED-CT was published in 1997 The rule-based and lexicon-based systems depended on name regular-ity and lexicon size which were restricted to extracting thelimited and normative entities However by incorporatingthe dependency relationships between words and the onlinedetection process with a search engine our approach solvesthese problems well

More recently Zhang and Elhadad proposed an unsu-pervised approach to biomedical-named entity recognitionleveraging terminologies syntactic knowledge and corpusstatistics [8] In addition the bootstrapping algorithmattracted much attention in the context of medical entity rec-ognition [9 10] Bootstrapping is an unsupervised machinelearning approach which starts from small sets of seeds orrules and iteratively labels the corpus with them by patternmatching [11] However it relies on the quality of seedsand the normalization of the corpus which easily producessemantic deviation due to involvement of the incorrect seedcategories and irregular context information

In recent years many researchers have focused on usingstatistical machine learning approaches in the medical fieldThe ME recognition problem is transformed into a sequenceannotation or a classification problem The lexical syntacticand semantic features of words are used for training variouslearning models such as HMM (hidden Markov model)[12 13] MEM (maximum entropy model) [14 15] CRF(conditional random field) [13 16ndash19] and (structured)SVM (support vector machine) [19] In addition to alleviatethe limitation of a single model some researchers proposedthe cascading methods [15 20] which combine multiplemodels including CRF (structured) SVM and MEM How-ever the supervised nature of the machine learning-basedapproaches relies on a large amount of training corpuswhich need to be annotated by humans Besides it is difficultfor the feature set to cover all entity types As a result theyare usually task-dependent To solve this problem we pro-pose an unsupervised approach which leverages syntacticknowledge corpus statistics and lexical resources forME recognition

Medical entity linking is a newer problem Some effectiveapproaches to English corpora have been proposed Glavasexploited semantic textual similarity for linking entity men-tions in clinical text [21] Zheng et al proposed a collectiveinference approach which leverages semantic informationand structures in ontology to solve the entity linking problemfor biomedical literature [22] Wang et al proposed a graph-based linking approach which first constructs graphs formentions KB and candidates and then exploits the informa-tion entropy and similarity algorithm to link biomedicalentities [23] These approaches are dependent on the contextand KB Therefore the noise and lack of information in thecontext reduce the accuracy of the linking In addition the

2 Journal of Healthcare Engineering

graph-based approaches have a high computation costand the imperfection of KB also impacts the performanceof the linking

Our linking approach synthetically considers multipleentity knowledge which is more accurate in distinguishingand determining the candidate entities with lower computa-tional costs Moreover our solution adds the step of extract-ing the relevant context to solve the noise problem Tooptimize the local KB we still introduce an incrementalevidence mining process with the third KB Entity linkingin the Chinese medical domain has been studied less thanentity linking in the English medical domain To ourknowledge our linking approach is the first solution forChinese online medical text

3 UnMERL

The framework of unMERL is shown in Figure 1 in whichunMERL consists of two modules The ME recognitionmodule consumes an input corpus and performs entityboundary detection and entity classification The output is aset of medical entities and categories For each recognizedmedical entity the ME linking module generates the candi-dates from the KB and then acquires the target object byranking them

31 Medical Entity Recognition The ME recognition moduleaims to detect and classify all ME mentions in the input cor-pus Named entity recognition (NER) [24] involves two mainsteps detecting entity boundaries and classifying the entitiesinto predefined categories Based on the thesis our ME rec-ognition module is implemented in the sequence of two sep-arate processes boundary detection and entity classification

311 Boundary Detection This step requires the detection ofboundaries of medical entities collecting candidates forentity classification In our solution unMERL exploits aknowledge-driven method mapping the input text to con-cepts in the lexical resources Compared to the existingdictionary-based approaches our approach differs in thefollowing ways (1) The entity candidates are identified basedon the dependency relationships between words The strat-egy can well capture the combined and nested entities andreduce the computational cost of the subsequent process bydownsizing the candidate set (2) The search engine isincluded as a lexical resource which breaks the condition-ality of the limited terms in the dictionary and has goodperformance in terms of its ability to detect variationaland rare entity names The detection process is roughlydivided into two stages candidate entity generation andmedical entity detection

(1) Candidate Entity Generation Through corpus analysiswe find that a long medical entity is usually segmented intoseveral fragments by a common nature language processingtool The POS tag of each fragment is included in Table 1Moreover these fragments generally have an attributivedependency relationship For example the text ldquo骨髓纤维化简称髓纤 是一种骨髓增生性疾病 武汉协和医院有很

好的治疗效果rdquo (myelofibrosis or MF in brief is a myelo-proliferative disease for which Wuhan Concorde Hospitalhas a very good therapeutic effect) parsed by the HanLPdependency parser (httphanlplinrunsoftcom) is shownin Figure 2 The dependency labels are shown in Table 1Based on the hypothesis that entities should be noun phrases(NPs) from the automatically parsed dependency trees weextract native NPs as candidate entities A native NP is asingle noun (without the attributive modifiers) or a maxi-mum noun phrase with the POS tags in Table 1 and thedependency ldquoATTrdquo The candidate entities extracted fromthe above text are shown in the third row of Table 1

However not all noun phrases are medical entitiesIn order to remove the nonmedical NPs we employ aknowledge-driven method whose aim is to discover theconcepts in the lexical resources referred to in the textHere we use Chinese SNOMED-CT [25] the medical KBof Baidu Baike (httpsbaikebaiducomsciencemedical)and Sogou medical dictionaries (httppinyinsogoucomdictcateindex132rf=dictindex) as the offline lexicalresources (LRs) In order to mitigate the limited coverage ofthe above resources we still use Baidu Search (httpswwwbaiducom) as an online lexical resource to help recognizethe medical entities

(2) LR Description As mentioned before Chinese SNOMED-CT translated from SNOMED-CT (English) is a standard ofclinical medicine and contains more than 140000 clinicalterms covering most aspects of clinical information Tocorrect incorrect terms in the translated version we add the

Corpus

Candidate entitygeneration

Offline detection

MErecognition

Boundarydetection Lexical

resources

Signature generation

Category decision

Entityclassification

Candidate entity generation

Candidate entity rankingby collaborative inference

Control flow

MElinking

Data flow

Results

Knowledgebase

Searchengine

Online detection

Seed term collection

Figure 1 Architecture of unMERL

3Journal of Healthcare Engineering

medical KB of Baidu Baike and Sogou medical dictionariesBaidu Baike contains more than 25000 medical terms editedby authoritative organizations and experts Sogou medicaldictionaries as the lexicon resource of Sogoursquos input methodcollect data from multiple medical websites Baidu Search isthe largest Chinese search engine using which we can obtaininformation that standardized LRs do not cover such asemerging rare and variational medical entities

(3) LR Preprocessing Considering the heterogeneity andredundancy of the above offline LRs we extract and fusethe medical terms from them to build a dictionary In partic-ular we select specific categories of interest which are alsothe goal of our entity classification Table 2 presents thestatistics in the self-built dictionary In addition to improveretrieval efficiency we build indices by using the first pho-netic alphabet of each term

tm =maxOccur r1 r2hellip rj

rj =LCS tk sk k isin K if Len LCS tk sk gt 1LCS tk ti k i isin K k ne i

1

Given a candidate entity the results returned by BaiduSearch contain not only the objective medical term but alsoother noise information that impacts the performance ofthe entity recognition Therefore we need to process thesearch results to obtain an unmixed medical term Based onthe common knowledge that there are more correct resultsthan incorrect results the method is implemented based oncorpus statistics and ldquoLCSrdquo (a function of achieving thelongest common substring) as shown in (1) Given thesearch result set S = tk sk K

k=1 (tk represents a title and skrepresents a summary) we first get the kernel term fromeach result by using the ldquoLCSrdquo function However not allsummaries contain the kernel terms in the titles Thereforewe add the process LCS tk ti for the search results without

common substring Finally in the kernel term set we selectthe most frequent term tm as a correct medical term

In addition considering that the search engine has nodistinguishing ability to filter the nonmedical entities in thecandidate set we establish a medical keyword set Thisincludes ldquo医rdquo (medicine) ldquo药rdquo (drug) ldquo病rdquo (disease) andldquo症rdquo (symptom) If a search result contains one or morekeywords in the above set we identify the candidate as amedical entity If not it is removed as a nonmedical entity

(4) Medical Entity Detection Once the medical terms areacquired from the offline and online LRs detecting themedical entities from the candidate set can be performedBased on the different characteristics of LRs (that the offlineLRs have high accuracy but limited coverage and the onlineLR have high coverage but lower accuracy) we divide thedetection process into offline and online processes Giventhe candidate set unMERL first performs the offlinedetection with the self-built dictionary For the outputnonmedical candidates unMERL performs the onlinedetection with Baidu search engine Here we exploit thestring matching and text distance constraint to implementthe detection process

Simr tcm tm = Len LCS tcm tmmin Len tcm Len tm

2

Dis tcm tm = Loc dtcm tcm minus Loc dtcm tm 3

Table 1 Constraints on POS tags description of dependency labels and candidate entities of the sentence in Figure 2

Notation Description

POS tagsf (preposition of locality) m (measure word) b (distinguishing word) rr (personal pronoun) v (verbal word)

gblowast (word related to biology) or nlowast (noun)

Dependency labelsHED (head) SBV (subject-verb) VOB (verb-object) ATT (attribution) COO (coordination)

RAD (right adjunct)

Candidate list骨髓纤维化 (myelofibrosis) 髓纤 (MF) 骨髓增生性疾病 (myeloproliferative disease)

武汉协和医院 (Wuhan Concorde Hospital) 治疗效果 (therapeutic effect)

HEDATT SBV

VOBATT

COO

ATT ATTATT

SBV RAD

VOBCOO

ATT ATTATT

VOB

Rootn nz v n n wp v m q n v nz wp ni v a u v n

Figure 2 An example sentence with dependency parsing and POS tagging

Table 2 Statistics of the user-defined dictionary

Category Term number

Body 1802

Disease 48120

Symptom 3698

Medicine 42047

Treatment 7403

Check 768

4 Journal of Healthcare Engineering

Given a candidate tcm and a medical term tm in LRs weuse the length proportion of their longest common substringand the shorter term as their name similarity as (2) Thesimilarity computation can capture the nested entities Forexample for a candidate ldquo胸肌内膜炎rdquo (endomysitis) if tmis ldquo胸肌rdquo (chest muscle) we can regard ldquo胸肌rdquo (chest mus-cle) as a nested entity In addition considering that someterms and their fractional terms exist together in a text weuse the text distance constraint to improve the detectorrsquosaccuracy For example a text contains both ldquo头孢rdquo (cephalo-sporin) and ldquo头孢拉定rdquo (cefradine) and the comparedmedical term is ldquo头孢拉定rdquo (cefradine) For the candidateldquo头孢rdquo (cephalosporin) if the text distance constraint is notused the output medical entity is ldquo头孢拉定rdquo (cefradine)Obviously this is the incorrect surface form for the candidateldquo头孢rdquo (cephalosporin) In (3) the sign dtcm represents thetext containing the candidate entity Function ldquoLocrdquo com-putes the location of the second parameter in the first param-eter Using the above two equations the specific detectionprocess is as Algorithm 1

In Algorithm 1 the input includes the candidate entityset the medical term set from the offline and online LRsand the input text The output is a set of medical entitiesGiven a candidate we first compute its name similarity witheach medical term inMT If they are the same the candidateis regarded as a medical entity If not we select the medicalterms exceeding the predetermined similarity threshold θfor performing the text distance calculation For each medicalterm ranked by name similarity if the text distance betweenthe medical term and the candidate is under the thresholdδ the medical term is output as the correct expression of thiscandidate In addition considering the existing of misspelledME names we add the ldquoDiffrdquo function to recognize themThis involves counting the number of different charactersbetween a candidate and the medical term (with the highestname similarity) If the number is less than the threshold ϵ(in our experiments it is set to the number of half the char-acters in a medical term in our experiments) we output themedical term as the correct expression of this candidateFor example for a candidate ldquo头孢拉丁rdquo (cefradine) thecompared medical term is ldquo头孢拉定rdquo (cefradine) meetingthe above condition Therefore we output this medical terminstead of the candidate

312 Entity Classification Entity categories are additionalinformation for characterizing the entities mentioned Theyare essential ingredients in many medical applications suchas medical dictionaries medical KBs and medical servicesystems Our classification approach is partly inspired bythe use of seed knowledge and context signature similarityin [8] The difference between our approach and the classifi-cation approach in [8] is in the following four ways (1) In thecollection of seed terms we use the framework informationin the terminology instead of the category tags reducingclassification error Meanwhile we classify some ME men-tions based on text feature computation thereby avoidingthe constraint of dissimilar context and the lack of context(2) Signature vector computation is refined through wordembedding which can better measure the semantic similarity

than the TF-IDF method (3) The filtering threshold isautomatically generated by averaging the signature similarityof seed terms thereby reducing labor costs and increasingfiltering accuracy (4) The seed set is scaled up continuallyto improve coverage The classification is implemented byapplying the following three steps seed term collectionsignature generation and category decision

(1) Seed Term Collection This step involves collecting seedterms for entity categories based on which the signaturevectors of the categories will be generated in the subsequentstep Here we utilize Baidu Baike to automatically gatherthe seed terms In an in-depth analysis we find that themedical entities of the same class have similar frameworkinformation in Baidu Baike which is more accurate than onlyusing category tag in identifying the entity category There-fore we design a text feature computation-based seed collec-tion approach Here we define T = s a d c as the set oftext features with a subtitle ldquosrdquo the attribute names ldquoardquo ofthe infobox the directory names ldquodrdquo of the content and thecategory tags ldquocrdquo in the entry page of Baidu Baike Theapproach is implemented as follows (1) From the self-builtdictionary we randomly select 50 terms from each categoryto extract and fuse their text features as the category signa-ture In particular we exploit the perfect string matchingalgorithm to produce unambiguous Baidu Baike entries(2) For each candidate we also crawl the feature informationfrom Baidu Baike Then we calculate its string similaritywith all category signatures by using (4) and classifying thiscandidate to the category with the highest similarity In par-ticular the signs Wc and Wcm represent the word sets of acategory signature and the feature information for a candi-date respectively Finally the classified candidate entitiesare used as the seeds for the category signature computationin the next step

Input candidate set C medical term set MT text set TOutput medical entity set ME

1 for ci isin C mtj isinMT tz isin T do2 if ci ==mtj then3 MElarr ci4 end if5 set M =empty6 if Sim cimtj gt θ then7 Mlarr Rank mtj 8 end if9 for mk isinM do10 if tz contains mk and Dis cimk lt δ or

Diff cimk lt ϵ then11 MElarrmk12 break13 end if14 end for15 end for16 return ME

Algorithm 1 Medical entity detection

5Journal of Healthcare Engineering

Simc WcmWc = Wcm capWc

Wcm4

(2) Signature Generation This step involves transforming themedical terms (including candidates and seeds) and catego-ries into signature vectors Here we use the phrase ldquotermsignaturerdquo to denote the vector of a ME mention or a seedterm Considering that the internal words have descriptiveability for a term we use the internal and context words forsignature generation To capture the semantic similaritybetween words we exploit a word embedding approach tocalculate the vector value of a word Here we use the Word2-Vec model a distributed representation model to express thewords in text as vectors based on deep learning technology[26] The training corpus is the input corpus the descriptioncontent of all medical terms in Baidu Baike and the searchresults of Baidu Search The final term signature vector iscomputed by averaging all word vectors in accordance with(5) In addition we use the phrase ldquocategory signaturerdquo todenote the vector of an entity category This is computedby averaging the signature vectors of all seed terms belongingto the same class following (5)

(3) Category Decision Once all term signatures and categorysignatures are generated the category of each candidate isidentified by using Algorithm 2 The symbol description isshown in Table 3 The similarity calculation between vectorsadopts a cosine similarity algorithm following (6) ThoughAlgorithm 2 each candidate exceeding the filtering thresholdis assigned to the category with the highest similarity Inaddition the filtering threshold is automatically computedby averaging the signature similarity of seed terms following(7) In particular ∣c∣ is the number of seed terms belongingto a class corresponding to ∣tk∣ in Algorithm 2 C2

∣c∣ is thecombination function counting the number of combina-tions of any two seeds Finally to increase the coverage ofthe seed set we add the classified candidate to the relevantseed signature set and then update the filtering thresholdand the category signature

vc = 1S

misinS

vm 5

Simcos va vb =I

i=1 vai times vbi

I

i=1 vai2 times I

i=1 vbi2 6

F vi vj = 1C2

c

c

ij=1inejSimcos vi vj 7

32 Medical Entity Linking We use the medical KB ofBaidu Baike as a basic KB To increase the accuracy ofthe similarity calculation we use the medical KB ofHudong Baike (httpwwwbaikecomsitecategory-10html)to expand the description information of the entities in thisbasic KB The method is as follows for each entity in KBwe acquire its page from Hudong Baike and then extractthe description content and category information

In accordance with the procedure of entity linking [27]the ME linking module has two stages candidate entitygeneration and ranking For each ME mention the modulefirst obtains its candidate entities from the KB and thenselects the top candidate (after ranking) as the linking entityThe mentions without linking entities are regarded as NIL

321 Candidate Entity Generation In this stage our goal isto increase the probability of the candidate set containing atarget entity and to control its size To accomplish the firstgoal we use the fuzzy string matching algorithm to computethe name similarity between a mention and all entities in theKB in accordance with (8) The function ldquoMCCrdquo acquires themost common characters between two strings in order Itcan well process the abbreviations and acronyms besidesthe standard names The entities exceeding the similaritythreshold α are included in the candidate set Howeverthis algorithm may result in a large candidate set

Table 3 Symbol description in Algorithm 2

Symbol Description

MA set containing each candidate entity mi and its

signature vector smi

F A threshold set filtering the nonmedical entities

tk A seed signature set of the same class

sa sb Seed signature

cj ct Category name

scj sct Category signature of cj or ct

f cj Filtering threshold of cj

Input candidate set M seed signature set T categorysignature set C

Output medical entity-category set E1 for mi smi

isinM do2 set F =empty D =empty3 for tk isin T sa sb isin tk do4 Flarr F sa sb 5 end for6 for scj isin C f cj isin F do

7 if Simcos smi scj gt f cj then

8 Dlarr Simcos smi scj cj

9 end if10 end for11 if D neempty then12 ct larr arg max D 13 Elarr mi ct 14 T larr mi 15 update sct isin C by (5)16 end if17 end for18 return E

Algorithm 2 Medical entity classification

6 Journal of Healthcare Engineering

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

graph-based approaches have a high computation costand the imperfection of KB also impacts the performanceof the linking

Our linking approach synthetically considers multipleentity knowledge which is more accurate in distinguishingand determining the candidate entities with lower computa-tional costs Moreover our solution adds the step of extract-ing the relevant context to solve the noise problem Tooptimize the local KB we still introduce an incrementalevidence mining process with the third KB Entity linkingin the Chinese medical domain has been studied less thanentity linking in the English medical domain To ourknowledge our linking approach is the first solution forChinese online medical text

3 UnMERL

The framework of unMERL is shown in Figure 1 in whichunMERL consists of two modules The ME recognitionmodule consumes an input corpus and performs entityboundary detection and entity classification The output is aset of medical entities and categories For each recognizedmedical entity the ME linking module generates the candi-dates from the KB and then acquires the target object byranking them

31 Medical Entity Recognition The ME recognition moduleaims to detect and classify all ME mentions in the input cor-pus Named entity recognition (NER) [24] involves two mainsteps detecting entity boundaries and classifying the entitiesinto predefined categories Based on the thesis our ME rec-ognition module is implemented in the sequence of two sep-arate processes boundary detection and entity classification

311 Boundary Detection This step requires the detection ofboundaries of medical entities collecting candidates forentity classification In our solution unMERL exploits aknowledge-driven method mapping the input text to con-cepts in the lexical resources Compared to the existingdictionary-based approaches our approach differs in thefollowing ways (1) The entity candidates are identified basedon the dependency relationships between words The strat-egy can well capture the combined and nested entities andreduce the computational cost of the subsequent process bydownsizing the candidate set (2) The search engine isincluded as a lexical resource which breaks the condition-ality of the limited terms in the dictionary and has goodperformance in terms of its ability to detect variationaland rare entity names The detection process is roughlydivided into two stages candidate entity generation andmedical entity detection

(1) Candidate Entity Generation Through corpus analysiswe find that a long medical entity is usually segmented intoseveral fragments by a common nature language processingtool The POS tag of each fragment is included in Table 1Moreover these fragments generally have an attributivedependency relationship For example the text ldquo骨髓纤维化简称髓纤 是一种骨髓增生性疾病 武汉协和医院有很

好的治疗效果rdquo (myelofibrosis or MF in brief is a myelo-proliferative disease for which Wuhan Concorde Hospitalhas a very good therapeutic effect) parsed by the HanLPdependency parser (httphanlplinrunsoftcom) is shownin Figure 2 The dependency labels are shown in Table 1Based on the hypothesis that entities should be noun phrases(NPs) from the automatically parsed dependency trees weextract native NPs as candidate entities A native NP is asingle noun (without the attributive modifiers) or a maxi-mum noun phrase with the POS tags in Table 1 and thedependency ldquoATTrdquo The candidate entities extracted fromthe above text are shown in the third row of Table 1

However not all noun phrases are medical entitiesIn order to remove the nonmedical NPs we employ aknowledge-driven method whose aim is to discover theconcepts in the lexical resources referred to in the textHere we use Chinese SNOMED-CT [25] the medical KBof Baidu Baike (httpsbaikebaiducomsciencemedical)and Sogou medical dictionaries (httppinyinsogoucomdictcateindex132rf=dictindex) as the offline lexicalresources (LRs) In order to mitigate the limited coverage ofthe above resources we still use Baidu Search (httpswwwbaiducom) as an online lexical resource to help recognizethe medical entities

(2) LR Description As mentioned before Chinese SNOMED-CT translated from SNOMED-CT (English) is a standard ofclinical medicine and contains more than 140000 clinicalterms covering most aspects of clinical information Tocorrect incorrect terms in the translated version we add the

Corpus

Candidate entitygeneration

Offline detection

MErecognition

Boundarydetection Lexical

resources

Signature generation

Category decision

Entityclassification

Candidate entity generation

Candidate entity rankingby collaborative inference

Control flow

MElinking

Data flow

Results

Knowledgebase

Searchengine

Online detection

Seed term collection

Figure 1 Architecture of unMERL

3Journal of Healthcare Engineering

medical KB of Baidu Baike and Sogou medical dictionariesBaidu Baike contains more than 25000 medical terms editedby authoritative organizations and experts Sogou medicaldictionaries as the lexicon resource of Sogoursquos input methodcollect data from multiple medical websites Baidu Search isthe largest Chinese search engine using which we can obtaininformation that standardized LRs do not cover such asemerging rare and variational medical entities

(3) LR Preprocessing Considering the heterogeneity andredundancy of the above offline LRs we extract and fusethe medical terms from them to build a dictionary In partic-ular we select specific categories of interest which are alsothe goal of our entity classification Table 2 presents thestatistics in the self-built dictionary In addition to improveretrieval efficiency we build indices by using the first pho-netic alphabet of each term

tm =maxOccur r1 r2hellip rj

rj =LCS tk sk k isin K if Len LCS tk sk gt 1LCS tk ti k i isin K k ne i

1

Given a candidate entity the results returned by BaiduSearch contain not only the objective medical term but alsoother noise information that impacts the performance ofthe entity recognition Therefore we need to process thesearch results to obtain an unmixed medical term Based onthe common knowledge that there are more correct resultsthan incorrect results the method is implemented based oncorpus statistics and ldquoLCSrdquo (a function of achieving thelongest common substring) as shown in (1) Given thesearch result set S = tk sk K

k=1 (tk represents a title and skrepresents a summary) we first get the kernel term fromeach result by using the ldquoLCSrdquo function However not allsummaries contain the kernel terms in the titles Thereforewe add the process LCS tk ti for the search results without

common substring Finally in the kernel term set we selectthe most frequent term tm as a correct medical term

In addition considering that the search engine has nodistinguishing ability to filter the nonmedical entities in thecandidate set we establish a medical keyword set Thisincludes ldquo医rdquo (medicine) ldquo药rdquo (drug) ldquo病rdquo (disease) andldquo症rdquo (symptom) If a search result contains one or morekeywords in the above set we identify the candidate as amedical entity If not it is removed as a nonmedical entity

(4) Medical Entity Detection Once the medical terms areacquired from the offline and online LRs detecting themedical entities from the candidate set can be performedBased on the different characteristics of LRs (that the offlineLRs have high accuracy but limited coverage and the onlineLR have high coverage but lower accuracy) we divide thedetection process into offline and online processes Giventhe candidate set unMERL first performs the offlinedetection with the self-built dictionary For the outputnonmedical candidates unMERL performs the onlinedetection with Baidu search engine Here we exploit thestring matching and text distance constraint to implementthe detection process

Simr tcm tm = Len LCS tcm tmmin Len tcm Len tm

2

Dis tcm tm = Loc dtcm tcm minus Loc dtcm tm 3

Table 1 Constraints on POS tags description of dependency labels and candidate entities of the sentence in Figure 2

Notation Description

POS tagsf (preposition of locality) m (measure word) b (distinguishing word) rr (personal pronoun) v (verbal word)

gblowast (word related to biology) or nlowast (noun)

Dependency labelsHED (head) SBV (subject-verb) VOB (verb-object) ATT (attribution) COO (coordination)

RAD (right adjunct)

Candidate list骨髓纤维化 (myelofibrosis) 髓纤 (MF) 骨髓增生性疾病 (myeloproliferative disease)

武汉协和医院 (Wuhan Concorde Hospital) 治疗效果 (therapeutic effect)

HEDATT SBV

VOBATT

COO

ATT ATTATT

SBV RAD

VOBCOO

ATT ATTATT

VOB

Rootn nz v n n wp v m q n v nz wp ni v a u v n

Figure 2 An example sentence with dependency parsing and POS tagging

Table 2 Statistics of the user-defined dictionary

Category Term number

Body 1802

Disease 48120

Symptom 3698

Medicine 42047

Treatment 7403

Check 768

4 Journal of Healthcare Engineering

Given a candidate tcm and a medical term tm in LRs weuse the length proportion of their longest common substringand the shorter term as their name similarity as (2) Thesimilarity computation can capture the nested entities Forexample for a candidate ldquo胸肌内膜炎rdquo (endomysitis) if tmis ldquo胸肌rdquo (chest muscle) we can regard ldquo胸肌rdquo (chest mus-cle) as a nested entity In addition considering that someterms and their fractional terms exist together in a text weuse the text distance constraint to improve the detectorrsquosaccuracy For example a text contains both ldquo头孢rdquo (cephalo-sporin) and ldquo头孢拉定rdquo (cefradine) and the comparedmedical term is ldquo头孢拉定rdquo (cefradine) For the candidateldquo头孢rdquo (cephalosporin) if the text distance constraint is notused the output medical entity is ldquo头孢拉定rdquo (cefradine)Obviously this is the incorrect surface form for the candidateldquo头孢rdquo (cephalosporin) In (3) the sign dtcm represents thetext containing the candidate entity Function ldquoLocrdquo com-putes the location of the second parameter in the first param-eter Using the above two equations the specific detectionprocess is as Algorithm 1

In Algorithm 1 the input includes the candidate entityset the medical term set from the offline and online LRsand the input text The output is a set of medical entitiesGiven a candidate we first compute its name similarity witheach medical term inMT If they are the same the candidateis regarded as a medical entity If not we select the medicalterms exceeding the predetermined similarity threshold θfor performing the text distance calculation For each medicalterm ranked by name similarity if the text distance betweenthe medical term and the candidate is under the thresholdδ the medical term is output as the correct expression of thiscandidate In addition considering the existing of misspelledME names we add the ldquoDiffrdquo function to recognize themThis involves counting the number of different charactersbetween a candidate and the medical term (with the highestname similarity) If the number is less than the threshold ϵ(in our experiments it is set to the number of half the char-acters in a medical term in our experiments) we output themedical term as the correct expression of this candidateFor example for a candidate ldquo头孢拉丁rdquo (cefradine) thecompared medical term is ldquo头孢拉定rdquo (cefradine) meetingthe above condition Therefore we output this medical terminstead of the candidate

312 Entity Classification Entity categories are additionalinformation for characterizing the entities mentioned Theyare essential ingredients in many medical applications suchas medical dictionaries medical KBs and medical servicesystems Our classification approach is partly inspired bythe use of seed knowledge and context signature similarityin [8] The difference between our approach and the classifi-cation approach in [8] is in the following four ways (1) In thecollection of seed terms we use the framework informationin the terminology instead of the category tags reducingclassification error Meanwhile we classify some ME men-tions based on text feature computation thereby avoidingthe constraint of dissimilar context and the lack of context(2) Signature vector computation is refined through wordembedding which can better measure the semantic similarity

than the TF-IDF method (3) The filtering threshold isautomatically generated by averaging the signature similarityof seed terms thereby reducing labor costs and increasingfiltering accuracy (4) The seed set is scaled up continuallyto improve coverage The classification is implemented byapplying the following three steps seed term collectionsignature generation and category decision

(1) Seed Term Collection This step involves collecting seedterms for entity categories based on which the signaturevectors of the categories will be generated in the subsequentstep Here we utilize Baidu Baike to automatically gatherthe seed terms In an in-depth analysis we find that themedical entities of the same class have similar frameworkinformation in Baidu Baike which is more accurate than onlyusing category tag in identifying the entity category There-fore we design a text feature computation-based seed collec-tion approach Here we define T = s a d c as the set oftext features with a subtitle ldquosrdquo the attribute names ldquoardquo ofthe infobox the directory names ldquodrdquo of the content and thecategory tags ldquocrdquo in the entry page of Baidu Baike Theapproach is implemented as follows (1) From the self-builtdictionary we randomly select 50 terms from each categoryto extract and fuse their text features as the category signa-ture In particular we exploit the perfect string matchingalgorithm to produce unambiguous Baidu Baike entries(2) For each candidate we also crawl the feature informationfrom Baidu Baike Then we calculate its string similaritywith all category signatures by using (4) and classifying thiscandidate to the category with the highest similarity In par-ticular the signs Wc and Wcm represent the word sets of acategory signature and the feature information for a candi-date respectively Finally the classified candidate entitiesare used as the seeds for the category signature computationin the next step

Input candidate set C medical term set MT text set TOutput medical entity set ME

1 for ci isin C mtj isinMT tz isin T do2 if ci ==mtj then3 MElarr ci4 end if5 set M =empty6 if Sim cimtj gt θ then7 Mlarr Rank mtj 8 end if9 for mk isinM do10 if tz contains mk and Dis cimk lt δ or

Diff cimk lt ϵ then11 MElarrmk12 break13 end if14 end for15 end for16 return ME

Algorithm 1 Medical entity detection

5Journal of Healthcare Engineering

Simc WcmWc = Wcm capWc

Wcm4

(2) Signature Generation This step involves transforming themedical terms (including candidates and seeds) and catego-ries into signature vectors Here we use the phrase ldquotermsignaturerdquo to denote the vector of a ME mention or a seedterm Considering that the internal words have descriptiveability for a term we use the internal and context words forsignature generation To capture the semantic similaritybetween words we exploit a word embedding approach tocalculate the vector value of a word Here we use the Word2-Vec model a distributed representation model to express thewords in text as vectors based on deep learning technology[26] The training corpus is the input corpus the descriptioncontent of all medical terms in Baidu Baike and the searchresults of Baidu Search The final term signature vector iscomputed by averaging all word vectors in accordance with(5) In addition we use the phrase ldquocategory signaturerdquo todenote the vector of an entity category This is computedby averaging the signature vectors of all seed terms belongingto the same class following (5)

(3) Category Decision Once all term signatures and categorysignatures are generated the category of each candidate isidentified by using Algorithm 2 The symbol description isshown in Table 3 The similarity calculation between vectorsadopts a cosine similarity algorithm following (6) ThoughAlgorithm 2 each candidate exceeding the filtering thresholdis assigned to the category with the highest similarity Inaddition the filtering threshold is automatically computedby averaging the signature similarity of seed terms following(7) In particular ∣c∣ is the number of seed terms belongingto a class corresponding to ∣tk∣ in Algorithm 2 C2

∣c∣ is thecombination function counting the number of combina-tions of any two seeds Finally to increase the coverage ofthe seed set we add the classified candidate to the relevantseed signature set and then update the filtering thresholdand the category signature

vc = 1S

misinS

vm 5

Simcos va vb =I

i=1 vai times vbi

I

i=1 vai2 times I

i=1 vbi2 6

F vi vj = 1C2

c

c

ij=1inejSimcos vi vj 7

32 Medical Entity Linking We use the medical KB ofBaidu Baike as a basic KB To increase the accuracy ofthe similarity calculation we use the medical KB ofHudong Baike (httpwwwbaikecomsitecategory-10html)to expand the description information of the entities in thisbasic KB The method is as follows for each entity in KBwe acquire its page from Hudong Baike and then extractthe description content and category information

In accordance with the procedure of entity linking [27]the ME linking module has two stages candidate entitygeneration and ranking For each ME mention the modulefirst obtains its candidate entities from the KB and thenselects the top candidate (after ranking) as the linking entityThe mentions without linking entities are regarded as NIL

321 Candidate Entity Generation In this stage our goal isto increase the probability of the candidate set containing atarget entity and to control its size To accomplish the firstgoal we use the fuzzy string matching algorithm to computethe name similarity between a mention and all entities in theKB in accordance with (8) The function ldquoMCCrdquo acquires themost common characters between two strings in order Itcan well process the abbreviations and acronyms besidesthe standard names The entities exceeding the similaritythreshold α are included in the candidate set Howeverthis algorithm may result in a large candidate set

Table 3 Symbol description in Algorithm 2

Symbol Description

MA set containing each candidate entity mi and its

signature vector smi

F A threshold set filtering the nonmedical entities

tk A seed signature set of the same class

sa sb Seed signature

cj ct Category name

scj sct Category signature of cj or ct

f cj Filtering threshold of cj

Input candidate set M seed signature set T categorysignature set C

Output medical entity-category set E1 for mi smi

isinM do2 set F =empty D =empty3 for tk isin T sa sb isin tk do4 Flarr F sa sb 5 end for6 for scj isin C f cj isin F do

7 if Simcos smi scj gt f cj then

8 Dlarr Simcos smi scj cj

9 end if10 end for11 if D neempty then12 ct larr arg max D 13 Elarr mi ct 14 T larr mi 15 update sct isin C by (5)16 end if17 end for18 return E

Algorithm 2 Medical entity classification

6 Journal of Healthcare Engineering

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

medical KB of Baidu Baike and Sogou medical dictionariesBaidu Baike contains more than 25000 medical terms editedby authoritative organizations and experts Sogou medicaldictionaries as the lexicon resource of Sogoursquos input methodcollect data from multiple medical websites Baidu Search isthe largest Chinese search engine using which we can obtaininformation that standardized LRs do not cover such asemerging rare and variational medical entities

(3) LR Preprocessing Considering the heterogeneity andredundancy of the above offline LRs we extract and fusethe medical terms from them to build a dictionary In partic-ular we select specific categories of interest which are alsothe goal of our entity classification Table 2 presents thestatistics in the self-built dictionary In addition to improveretrieval efficiency we build indices by using the first pho-netic alphabet of each term

tm =maxOccur r1 r2hellip rj

rj =LCS tk sk k isin K if Len LCS tk sk gt 1LCS tk ti k i isin K k ne i

1

Given a candidate entity the results returned by BaiduSearch contain not only the objective medical term but alsoother noise information that impacts the performance ofthe entity recognition Therefore we need to process thesearch results to obtain an unmixed medical term Based onthe common knowledge that there are more correct resultsthan incorrect results the method is implemented based oncorpus statistics and ldquoLCSrdquo (a function of achieving thelongest common substring) as shown in (1) Given thesearch result set S = tk sk K

k=1 (tk represents a title and skrepresents a summary) we first get the kernel term fromeach result by using the ldquoLCSrdquo function However not allsummaries contain the kernel terms in the titles Thereforewe add the process LCS tk ti for the search results without

common substring Finally in the kernel term set we selectthe most frequent term tm as a correct medical term

In addition considering that the search engine has nodistinguishing ability to filter the nonmedical entities in thecandidate set we establish a medical keyword set Thisincludes ldquo医rdquo (medicine) ldquo药rdquo (drug) ldquo病rdquo (disease) andldquo症rdquo (symptom) If a search result contains one or morekeywords in the above set we identify the candidate as amedical entity If not it is removed as a nonmedical entity

(4) Medical Entity Detection Once the medical terms areacquired from the offline and online LRs detecting themedical entities from the candidate set can be performedBased on the different characteristics of LRs (that the offlineLRs have high accuracy but limited coverage and the onlineLR have high coverage but lower accuracy) we divide thedetection process into offline and online processes Giventhe candidate set unMERL first performs the offlinedetection with the self-built dictionary For the outputnonmedical candidates unMERL performs the onlinedetection with Baidu search engine Here we exploit thestring matching and text distance constraint to implementthe detection process

Simr tcm tm = Len LCS tcm tmmin Len tcm Len tm

2

Dis tcm tm = Loc dtcm tcm minus Loc dtcm tm 3

Table 1 Constraints on POS tags description of dependency labels and candidate entities of the sentence in Figure 2

Notation Description

POS tagsf (preposition of locality) m (measure word) b (distinguishing word) rr (personal pronoun) v (verbal word)

gblowast (word related to biology) or nlowast (noun)

Dependency labelsHED (head) SBV (subject-verb) VOB (verb-object) ATT (attribution) COO (coordination)

RAD (right adjunct)

Candidate list骨髓纤维化 (myelofibrosis) 髓纤 (MF) 骨髓增生性疾病 (myeloproliferative disease)

武汉协和医院 (Wuhan Concorde Hospital) 治疗效果 (therapeutic effect)

HEDATT SBV

VOBATT

COO

ATT ATTATT

SBV RAD

VOBCOO

ATT ATTATT

VOB

Rootn nz v n n wp v m q n v nz wp ni v a u v n

Figure 2 An example sentence with dependency parsing and POS tagging

Table 2 Statistics of the user-defined dictionary

Category Term number

Body 1802

Disease 48120

Symptom 3698

Medicine 42047

Treatment 7403

Check 768

4 Journal of Healthcare Engineering

Given a candidate tcm and a medical term tm in LRs weuse the length proportion of their longest common substringand the shorter term as their name similarity as (2) Thesimilarity computation can capture the nested entities Forexample for a candidate ldquo胸肌内膜炎rdquo (endomysitis) if tmis ldquo胸肌rdquo (chest muscle) we can regard ldquo胸肌rdquo (chest mus-cle) as a nested entity In addition considering that someterms and their fractional terms exist together in a text weuse the text distance constraint to improve the detectorrsquosaccuracy For example a text contains both ldquo头孢rdquo (cephalo-sporin) and ldquo头孢拉定rdquo (cefradine) and the comparedmedical term is ldquo头孢拉定rdquo (cefradine) For the candidateldquo头孢rdquo (cephalosporin) if the text distance constraint is notused the output medical entity is ldquo头孢拉定rdquo (cefradine)Obviously this is the incorrect surface form for the candidateldquo头孢rdquo (cephalosporin) In (3) the sign dtcm represents thetext containing the candidate entity Function ldquoLocrdquo com-putes the location of the second parameter in the first param-eter Using the above two equations the specific detectionprocess is as Algorithm 1

In Algorithm 1 the input includes the candidate entityset the medical term set from the offline and online LRsand the input text The output is a set of medical entitiesGiven a candidate we first compute its name similarity witheach medical term inMT If they are the same the candidateis regarded as a medical entity If not we select the medicalterms exceeding the predetermined similarity threshold θfor performing the text distance calculation For each medicalterm ranked by name similarity if the text distance betweenthe medical term and the candidate is under the thresholdδ the medical term is output as the correct expression of thiscandidate In addition considering the existing of misspelledME names we add the ldquoDiffrdquo function to recognize themThis involves counting the number of different charactersbetween a candidate and the medical term (with the highestname similarity) If the number is less than the threshold ϵ(in our experiments it is set to the number of half the char-acters in a medical term in our experiments) we output themedical term as the correct expression of this candidateFor example for a candidate ldquo头孢拉丁rdquo (cefradine) thecompared medical term is ldquo头孢拉定rdquo (cefradine) meetingthe above condition Therefore we output this medical terminstead of the candidate

312 Entity Classification Entity categories are additionalinformation for characterizing the entities mentioned Theyare essential ingredients in many medical applications suchas medical dictionaries medical KBs and medical servicesystems Our classification approach is partly inspired bythe use of seed knowledge and context signature similarityin [8] The difference between our approach and the classifi-cation approach in [8] is in the following four ways (1) In thecollection of seed terms we use the framework informationin the terminology instead of the category tags reducingclassification error Meanwhile we classify some ME men-tions based on text feature computation thereby avoidingthe constraint of dissimilar context and the lack of context(2) Signature vector computation is refined through wordembedding which can better measure the semantic similarity

than the TF-IDF method (3) The filtering threshold isautomatically generated by averaging the signature similarityof seed terms thereby reducing labor costs and increasingfiltering accuracy (4) The seed set is scaled up continuallyto improve coverage The classification is implemented byapplying the following three steps seed term collectionsignature generation and category decision

(1) Seed Term Collection This step involves collecting seedterms for entity categories based on which the signaturevectors of the categories will be generated in the subsequentstep Here we utilize Baidu Baike to automatically gatherthe seed terms In an in-depth analysis we find that themedical entities of the same class have similar frameworkinformation in Baidu Baike which is more accurate than onlyusing category tag in identifying the entity category There-fore we design a text feature computation-based seed collec-tion approach Here we define T = s a d c as the set oftext features with a subtitle ldquosrdquo the attribute names ldquoardquo ofthe infobox the directory names ldquodrdquo of the content and thecategory tags ldquocrdquo in the entry page of Baidu Baike Theapproach is implemented as follows (1) From the self-builtdictionary we randomly select 50 terms from each categoryto extract and fuse their text features as the category signa-ture In particular we exploit the perfect string matchingalgorithm to produce unambiguous Baidu Baike entries(2) For each candidate we also crawl the feature informationfrom Baidu Baike Then we calculate its string similaritywith all category signatures by using (4) and classifying thiscandidate to the category with the highest similarity In par-ticular the signs Wc and Wcm represent the word sets of acategory signature and the feature information for a candi-date respectively Finally the classified candidate entitiesare used as the seeds for the category signature computationin the next step

Input candidate set C medical term set MT text set TOutput medical entity set ME

1 for ci isin C mtj isinMT tz isin T do2 if ci ==mtj then3 MElarr ci4 end if5 set M =empty6 if Sim cimtj gt θ then7 Mlarr Rank mtj 8 end if9 for mk isinM do10 if tz contains mk and Dis cimk lt δ or

Diff cimk lt ϵ then11 MElarrmk12 break13 end if14 end for15 end for16 return ME

Algorithm 1 Medical entity detection

5Journal of Healthcare Engineering

Simc WcmWc = Wcm capWc

Wcm4

(2) Signature Generation This step involves transforming themedical terms (including candidates and seeds) and catego-ries into signature vectors Here we use the phrase ldquotermsignaturerdquo to denote the vector of a ME mention or a seedterm Considering that the internal words have descriptiveability for a term we use the internal and context words forsignature generation To capture the semantic similaritybetween words we exploit a word embedding approach tocalculate the vector value of a word Here we use the Word2-Vec model a distributed representation model to express thewords in text as vectors based on deep learning technology[26] The training corpus is the input corpus the descriptioncontent of all medical terms in Baidu Baike and the searchresults of Baidu Search The final term signature vector iscomputed by averaging all word vectors in accordance with(5) In addition we use the phrase ldquocategory signaturerdquo todenote the vector of an entity category This is computedby averaging the signature vectors of all seed terms belongingto the same class following (5)

(3) Category Decision Once all term signatures and categorysignatures are generated the category of each candidate isidentified by using Algorithm 2 The symbol description isshown in Table 3 The similarity calculation between vectorsadopts a cosine similarity algorithm following (6) ThoughAlgorithm 2 each candidate exceeding the filtering thresholdis assigned to the category with the highest similarity Inaddition the filtering threshold is automatically computedby averaging the signature similarity of seed terms following(7) In particular ∣c∣ is the number of seed terms belongingto a class corresponding to ∣tk∣ in Algorithm 2 C2

∣c∣ is thecombination function counting the number of combina-tions of any two seeds Finally to increase the coverage ofthe seed set we add the classified candidate to the relevantseed signature set and then update the filtering thresholdand the category signature

vc = 1S

misinS

vm 5

Simcos va vb =I

i=1 vai times vbi

I

i=1 vai2 times I

i=1 vbi2 6

F vi vj = 1C2

c

c

ij=1inejSimcos vi vj 7

32 Medical Entity Linking We use the medical KB ofBaidu Baike as a basic KB To increase the accuracy ofthe similarity calculation we use the medical KB ofHudong Baike (httpwwwbaikecomsitecategory-10html)to expand the description information of the entities in thisbasic KB The method is as follows for each entity in KBwe acquire its page from Hudong Baike and then extractthe description content and category information

In accordance with the procedure of entity linking [27]the ME linking module has two stages candidate entitygeneration and ranking For each ME mention the modulefirst obtains its candidate entities from the KB and thenselects the top candidate (after ranking) as the linking entityThe mentions without linking entities are regarded as NIL

321 Candidate Entity Generation In this stage our goal isto increase the probability of the candidate set containing atarget entity and to control its size To accomplish the firstgoal we use the fuzzy string matching algorithm to computethe name similarity between a mention and all entities in theKB in accordance with (8) The function ldquoMCCrdquo acquires themost common characters between two strings in order Itcan well process the abbreviations and acronyms besidesthe standard names The entities exceeding the similaritythreshold α are included in the candidate set Howeverthis algorithm may result in a large candidate set

Table 3 Symbol description in Algorithm 2

Symbol Description

MA set containing each candidate entity mi and its

signature vector smi

F A threshold set filtering the nonmedical entities

tk A seed signature set of the same class

sa sb Seed signature

cj ct Category name

scj sct Category signature of cj or ct

f cj Filtering threshold of cj

Input candidate set M seed signature set T categorysignature set C

Output medical entity-category set E1 for mi smi

isinM do2 set F =empty D =empty3 for tk isin T sa sb isin tk do4 Flarr F sa sb 5 end for6 for scj isin C f cj isin F do

7 if Simcos smi scj gt f cj then

8 Dlarr Simcos smi scj cj

9 end if10 end for11 if D neempty then12 ct larr arg max D 13 Elarr mi ct 14 T larr mi 15 update sct isin C by (5)16 end if17 end for18 return E

Algorithm 2 Medical entity classification

6 Journal of Healthcare Engineering

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

Given a candidate tcm and a medical term tm in LRs weuse the length proportion of their longest common substringand the shorter term as their name similarity as (2) Thesimilarity computation can capture the nested entities Forexample for a candidate ldquo胸肌内膜炎rdquo (endomysitis) if tmis ldquo胸肌rdquo (chest muscle) we can regard ldquo胸肌rdquo (chest mus-cle) as a nested entity In addition considering that someterms and their fractional terms exist together in a text weuse the text distance constraint to improve the detectorrsquosaccuracy For example a text contains both ldquo头孢rdquo (cephalo-sporin) and ldquo头孢拉定rdquo (cefradine) and the comparedmedical term is ldquo头孢拉定rdquo (cefradine) For the candidateldquo头孢rdquo (cephalosporin) if the text distance constraint is notused the output medical entity is ldquo头孢拉定rdquo (cefradine)Obviously this is the incorrect surface form for the candidateldquo头孢rdquo (cephalosporin) In (3) the sign dtcm represents thetext containing the candidate entity Function ldquoLocrdquo com-putes the location of the second parameter in the first param-eter Using the above two equations the specific detectionprocess is as Algorithm 1

In Algorithm 1 the input includes the candidate entityset the medical term set from the offline and online LRsand the input text The output is a set of medical entitiesGiven a candidate we first compute its name similarity witheach medical term inMT If they are the same the candidateis regarded as a medical entity If not we select the medicalterms exceeding the predetermined similarity threshold θfor performing the text distance calculation For each medicalterm ranked by name similarity if the text distance betweenthe medical term and the candidate is under the thresholdδ the medical term is output as the correct expression of thiscandidate In addition considering the existing of misspelledME names we add the ldquoDiffrdquo function to recognize themThis involves counting the number of different charactersbetween a candidate and the medical term (with the highestname similarity) If the number is less than the threshold ϵ(in our experiments it is set to the number of half the char-acters in a medical term in our experiments) we output themedical term as the correct expression of this candidateFor example for a candidate ldquo头孢拉丁rdquo (cefradine) thecompared medical term is ldquo头孢拉定rdquo (cefradine) meetingthe above condition Therefore we output this medical terminstead of the candidate

312 Entity Classification Entity categories are additionalinformation for characterizing the entities mentioned Theyare essential ingredients in many medical applications suchas medical dictionaries medical KBs and medical servicesystems Our classification approach is partly inspired bythe use of seed knowledge and context signature similarityin [8] The difference between our approach and the classifi-cation approach in [8] is in the following four ways (1) In thecollection of seed terms we use the framework informationin the terminology instead of the category tags reducingclassification error Meanwhile we classify some ME men-tions based on text feature computation thereby avoidingthe constraint of dissimilar context and the lack of context(2) Signature vector computation is refined through wordembedding which can better measure the semantic similarity

than the TF-IDF method (3) The filtering threshold isautomatically generated by averaging the signature similarityof seed terms thereby reducing labor costs and increasingfiltering accuracy (4) The seed set is scaled up continuallyto improve coverage The classification is implemented byapplying the following three steps seed term collectionsignature generation and category decision

(1) Seed Term Collection This step involves collecting seedterms for entity categories based on which the signaturevectors of the categories will be generated in the subsequentstep Here we utilize Baidu Baike to automatically gatherthe seed terms In an in-depth analysis we find that themedical entities of the same class have similar frameworkinformation in Baidu Baike which is more accurate than onlyusing category tag in identifying the entity category There-fore we design a text feature computation-based seed collec-tion approach Here we define T = s a d c as the set oftext features with a subtitle ldquosrdquo the attribute names ldquoardquo ofthe infobox the directory names ldquodrdquo of the content and thecategory tags ldquocrdquo in the entry page of Baidu Baike Theapproach is implemented as follows (1) From the self-builtdictionary we randomly select 50 terms from each categoryto extract and fuse their text features as the category signa-ture In particular we exploit the perfect string matchingalgorithm to produce unambiguous Baidu Baike entries(2) For each candidate we also crawl the feature informationfrom Baidu Baike Then we calculate its string similaritywith all category signatures by using (4) and classifying thiscandidate to the category with the highest similarity In par-ticular the signs Wc and Wcm represent the word sets of acategory signature and the feature information for a candi-date respectively Finally the classified candidate entitiesare used as the seeds for the category signature computationin the next step

Input candidate set C medical term set MT text set TOutput medical entity set ME

1 for ci isin C mtj isinMT tz isin T do2 if ci ==mtj then3 MElarr ci4 end if5 set M =empty6 if Sim cimtj gt θ then7 Mlarr Rank mtj 8 end if9 for mk isinM do10 if tz contains mk and Dis cimk lt δ or

Diff cimk lt ϵ then11 MElarrmk12 break13 end if14 end for15 end for16 return ME

Algorithm 1 Medical entity detection

5Journal of Healthcare Engineering

Simc WcmWc = Wcm capWc

Wcm4

(2) Signature Generation This step involves transforming themedical terms (including candidates and seeds) and catego-ries into signature vectors Here we use the phrase ldquotermsignaturerdquo to denote the vector of a ME mention or a seedterm Considering that the internal words have descriptiveability for a term we use the internal and context words forsignature generation To capture the semantic similaritybetween words we exploit a word embedding approach tocalculate the vector value of a word Here we use the Word2-Vec model a distributed representation model to express thewords in text as vectors based on deep learning technology[26] The training corpus is the input corpus the descriptioncontent of all medical terms in Baidu Baike and the searchresults of Baidu Search The final term signature vector iscomputed by averaging all word vectors in accordance with(5) In addition we use the phrase ldquocategory signaturerdquo todenote the vector of an entity category This is computedby averaging the signature vectors of all seed terms belongingto the same class following (5)

(3) Category Decision Once all term signatures and categorysignatures are generated the category of each candidate isidentified by using Algorithm 2 The symbol description isshown in Table 3 The similarity calculation between vectorsadopts a cosine similarity algorithm following (6) ThoughAlgorithm 2 each candidate exceeding the filtering thresholdis assigned to the category with the highest similarity Inaddition the filtering threshold is automatically computedby averaging the signature similarity of seed terms following(7) In particular ∣c∣ is the number of seed terms belongingto a class corresponding to ∣tk∣ in Algorithm 2 C2

∣c∣ is thecombination function counting the number of combina-tions of any two seeds Finally to increase the coverage ofthe seed set we add the classified candidate to the relevantseed signature set and then update the filtering thresholdand the category signature

vc = 1S

misinS

vm 5

Simcos va vb =I

i=1 vai times vbi

I

i=1 vai2 times I

i=1 vbi2 6

F vi vj = 1C2

c

c

ij=1inejSimcos vi vj 7

32 Medical Entity Linking We use the medical KB ofBaidu Baike as a basic KB To increase the accuracy ofthe similarity calculation we use the medical KB ofHudong Baike (httpwwwbaikecomsitecategory-10html)to expand the description information of the entities in thisbasic KB The method is as follows for each entity in KBwe acquire its page from Hudong Baike and then extractthe description content and category information

In accordance with the procedure of entity linking [27]the ME linking module has two stages candidate entitygeneration and ranking For each ME mention the modulefirst obtains its candidate entities from the KB and thenselects the top candidate (after ranking) as the linking entityThe mentions without linking entities are regarded as NIL

321 Candidate Entity Generation In this stage our goal isto increase the probability of the candidate set containing atarget entity and to control its size To accomplish the firstgoal we use the fuzzy string matching algorithm to computethe name similarity between a mention and all entities in theKB in accordance with (8) The function ldquoMCCrdquo acquires themost common characters between two strings in order Itcan well process the abbreviations and acronyms besidesthe standard names The entities exceeding the similaritythreshold α are included in the candidate set Howeverthis algorithm may result in a large candidate set

Table 3 Symbol description in Algorithm 2

Symbol Description

MA set containing each candidate entity mi and its

signature vector smi

F A threshold set filtering the nonmedical entities

tk A seed signature set of the same class

sa sb Seed signature

cj ct Category name

scj sct Category signature of cj or ct

f cj Filtering threshold of cj

Input candidate set M seed signature set T categorysignature set C

Output medical entity-category set E1 for mi smi

isinM do2 set F =empty D =empty3 for tk isin T sa sb isin tk do4 Flarr F sa sb 5 end for6 for scj isin C f cj isin F do

7 if Simcos smi scj gt f cj then

8 Dlarr Simcos smi scj cj

9 end if10 end for11 if D neempty then12 ct larr arg max D 13 Elarr mi ct 14 T larr mi 15 update sct isin C by (5)16 end if17 end for18 return E

Algorithm 2 Medical entity classification

6 Journal of Healthcare Engineering

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

Simc WcmWc = Wcm capWc

Wcm4

(2) Signature Generation This step involves transforming themedical terms (including candidates and seeds) and catego-ries into signature vectors Here we use the phrase ldquotermsignaturerdquo to denote the vector of a ME mention or a seedterm Considering that the internal words have descriptiveability for a term we use the internal and context words forsignature generation To capture the semantic similaritybetween words we exploit a word embedding approach tocalculate the vector value of a word Here we use the Word2-Vec model a distributed representation model to express thewords in text as vectors based on deep learning technology[26] The training corpus is the input corpus the descriptioncontent of all medical terms in Baidu Baike and the searchresults of Baidu Search The final term signature vector iscomputed by averaging all word vectors in accordance with(5) In addition we use the phrase ldquocategory signaturerdquo todenote the vector of an entity category This is computedby averaging the signature vectors of all seed terms belongingto the same class following (5)

(3) Category Decision Once all term signatures and categorysignatures are generated the category of each candidate isidentified by using Algorithm 2 The symbol description isshown in Table 3 The similarity calculation between vectorsadopts a cosine similarity algorithm following (6) ThoughAlgorithm 2 each candidate exceeding the filtering thresholdis assigned to the category with the highest similarity Inaddition the filtering threshold is automatically computedby averaging the signature similarity of seed terms following(7) In particular ∣c∣ is the number of seed terms belongingto a class corresponding to ∣tk∣ in Algorithm 2 C2

∣c∣ is thecombination function counting the number of combina-tions of any two seeds Finally to increase the coverage ofthe seed set we add the classified candidate to the relevantseed signature set and then update the filtering thresholdand the category signature

vc = 1S

misinS

vm 5

Simcos va vb =I

i=1 vai times vbi

I

i=1 vai2 times I

i=1 vbi2 6

F vi vj = 1C2

c

c

ij=1inejSimcos vi vj 7

32 Medical Entity Linking We use the medical KB ofBaidu Baike as a basic KB To increase the accuracy ofthe similarity calculation we use the medical KB ofHudong Baike (httpwwwbaikecomsitecategory-10html)to expand the description information of the entities in thisbasic KB The method is as follows for each entity in KBwe acquire its page from Hudong Baike and then extractthe description content and category information

In accordance with the procedure of entity linking [27]the ME linking module has two stages candidate entitygeneration and ranking For each ME mention the modulefirst obtains its candidate entities from the KB and thenselects the top candidate (after ranking) as the linking entityThe mentions without linking entities are regarded as NIL

321 Candidate Entity Generation In this stage our goal isto increase the probability of the candidate set containing atarget entity and to control its size To accomplish the firstgoal we use the fuzzy string matching algorithm to computethe name similarity between a mention and all entities in theKB in accordance with (8) The function ldquoMCCrdquo acquires themost common characters between two strings in order Itcan well process the abbreviations and acronyms besidesthe standard names The entities exceeding the similaritythreshold α are included in the candidate set Howeverthis algorithm may result in a large candidate set

Table 3 Symbol description in Algorithm 2

Symbol Description

MA set containing each candidate entity mi and its

signature vector smi

F A threshold set filtering the nonmedical entities

tk A seed signature set of the same class

sa sb Seed signature

cj ct Category name

scj sct Category signature of cj or ct

f cj Filtering threshold of cj

Input candidate set M seed signature set T categorysignature set C

Output medical entity-category set E1 for mi smi

isinM do2 set F =empty D =empty3 for tk isin T sa sb isin tk do4 Flarr F sa sb 5 end for6 for scj isin C f cj isin F do

7 if Simcos smi scj gt f cj then

8 Dlarr Simcos smi scj cj

9 end if10 end for11 if D neempty then12 ct larr arg max D 13 Elarr mi ct 14 T larr mi 15 update sct isin C by (5)16 end if17 end for18 return E

Algorithm 2 Medical entity classification

6 Journal of Healthcare Engineering

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

To reduce the computational cost in the subsequent pro-cessing we introduce the condition of category consistencyto control the size The specific method is as follows for eachcandidate we acquire its text features in Baidu Baike andthen compute the similarity between the category signaturesacquired in the section of seed term collection following (4)The candidates under a predefined threshold β are removedfrom the candidate set This strategy can still well processthe terms that have the same name but different meaningsFor example for a ME mention ldquo传染病rdquo (epidemic) itscandidate set includes ldquo传染病 (疾病)rdquo (epidemic [disease])ldquo传染病 (游戏)rdquo (contagion [game]) and ldquo传染病 (电影)rdquo(contagion [film]) Through category constraint the lattertwo candidates are removed

Siml tme te = MCC tme temin Len tme Len te

8

322 Candidate Entity Ranking This stage aims to acquirethe linking entity in the candidate set by ranking using aconfidence score We propose a collaborative inferencemethod which synthetically exploits the name similarityentity popularity context similarity and the semantic corre-lation between entities

Specifically the name similarity of the mention and itscandidates is computed using (8) In addition based on thecommon knowledge that the most important entity is themost frequently mentioned we introduce the entity popular-ity for distinguishing and discriminating between thecandidate entities Here we utilize the number of visits inthe Baidu Baike page to indicate the entity popularity whichis a positive integer (eg 15348) Considering that the entitypopularity is not the only decisive criterion we establish aconversion to ensure its effectiveness and to avoid impactingother measuring conditions Given the visiting number n theentity popularity is computed as

p n = n times 10 n + n

10 n +1 9

in which ∣n∣ expresses the digit number For instance theabove integer is translated into 0515348

The existing context similarity-based approaches gener-ally extract the words in a fixed window which ignores thenoise information in the context To increase the descriptionability of the context words of a mention we explore arelevant information extraction approach based on thedependency relationships between words Specifically thisextracts all words that have a dependency relationship witha mention as the context information For example inFigure 2 the relevant information of ldquo骨髓纤维化rdquo (myelo-fibrosis) is ldquo髓纤rdquo (MF) and ldquo骨髓增生性疾病rdquo (myelopro-liferative disease) Then we compute its string similarity withthe description content of each candidate by using (4) Inparticular the signsWcm andWc represent the context wordsets of a mention and a candidate Of note before similaritycomputation we need to remove the stop words in thecontext and the description content

However the context information acquired by the aboveextraction approach is limited It may result in the same

context similarity between different candidates Moreoversome mentions may have no context information For thementions we add the semantic correlation knowledge forranking based on the hypothesis that the linking entities ofthe cooccurring entities in text are also correlated and theyhave overlapping context information The special methodis as follows (1) In the context of a mention we select someME mentions (with the linking entities) as the collaborators(2) We extract the anchors and other noun phrases (whichare more descriptive than other words) from the descriptioncontent of these linking entities and the candidates of themention respectively (3) The context similarity betweeneach candidate and all linking entities is computed and thecandidate with the highest similarity is regarded as thetarget entity

In conclusion the confidence score of the candidateentities can be computed by using (10) λ is a control factor(the value is 1 or 0) controlling whether the semantic corre-lation is computed If the context similarity of each candidateis 0 or the same λ = 1 If not λ = 0 Given a mention tme anda candidate tce the linking entity set of the collaborators Lthe confidence score is computed using

CS tme tce L = Siml tme tce + P tce + Simc I tme D tce

+ λSimc A tce tlekisinL

A tlek

10The signs ldquoPrdquo ldquoIrdquo ldquoDrdquo and ldquoArdquo express the entity popularitythe relevant context information the description contentand the special content containing only anchors and nounphrases in the KB respectively

In order to better understand the ranking process weprovide an example Given the text ldquoNShellip功能紊乱体现在失眠 多梦 盗汗⋯rdquo (NS⋯ the dysfunction is reflectedin insomnia dreaminess and night sweats⋯) the recognizedME mentions are ldquoNSrdquo ldquo失眠rdquo (insomnia) ldquo多梦rdquo (dreami-ness) and ldquo盗汗rdquo (night sweats) Through the previousprocess we find that ldquoNSrdquo has multiple candidate entitieswith the same name such as ldquoNS (nervous system)rdquo ldquoNS(nephrotic syndrome)rdquo and ldquoNS (normal saline)rdquo Theirname similarity is 1 and their other measuring scores areas in Figure 3 It must be noted that we only present thepartial value of the entity popularity for the purpose of savingspace According to the confidence scores computed using(10) the candidate ldquoNS (nervous system)rdquo is selected as thelinking entity with the highest score (07577)

4 Experiments

41 Experimental Data We crawl 5000 medical QampA textrecords from three Chinese medical websites to evaluateour proposed framework including ldquo家庭医生在线rdquo (Fam-ily-doctor) ldquo拇指医生rdquo (Muzhi-doctor) and ldquo求医网rdquo(Qiuyi) Next we randomly select 500 records from eachcorpus to recognize all medical entities classify them to thesix categories in Table 3 and link them to the KB manuallyIn total we recognize 6596 ME mentions and link 3821

7Journal of Healthcare Engineering

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

mentions to the correct entries in the KB whose statistics areshown in Table 4 The sign ldquoNILrdquo expresses the MEmentionswithout the linking entities in the KB

42 Experimental Evaluation

421 Comparative Methods To thoroughly validate theeffectiveness of unMERL we conduct a comparison betweenthe representative state-of-the-art methods and our proposedmethods in the recognition and linking modules respec-tively For ME recognition we select BM-NER [8] andbubble-bootstrapping [10] which are unsupervised methodsas well as Dic-CRF (to distinguish the method we have givenit this name as it is a supervised method) [18] as thecomparative methods In particular for the BM-NERmethod we use the Stanford parser (httpsnlpstanfordedusoftwarelex-parserhtml) for chunking The seed termsare taken from our built dictionary For the Dic-CRFmethodwe split 500 records into two subsets two-thirds for trainingand one-third for testing In ME linking we select QCV (alanguage independent and unsupervised method) [13] as acomparative method In addition it is necessary to state thatwe use the same seeds in [10] for the bubble-bootstrappingmethod the same features in [18] and our built dictionaryfor the Dic-CRF method as well as the anchors in the KBto build a KB graph for the QCV method

422 Measuring Methods We use P (precision) R (recall)and F1 to measure performance P is the fraction of thecorrect objects in all objects acquired by the method R isthe fraction of the correct objects acquired by the methodin the valid objects in the corpus F1 is defined as 2 times P times RP + R In addition we still use ldquoaccuracyrdquo to measure thewhole linking accuracy as shown in (11) In particular∣Slink∣ and ∣SNIL∣ express the number of ME mentions thatare linked or not linked to the correct entities in the KB bythe method ∣T∣ represents the number of ME mentions inthe corpus

Accuracy = ∣Slink ∣ + ∣ SNIL ∣∣T ∣

times 100 11

43 Experimental Results and Discussion To simulate MErecognition and linking tasks in an open environment (note

Insomnia

Dreaminess

Nervoussystem

Centralnervoussystem

C Context similarityP Entity popularityS Semantic correlation

S 013

S 002

S 000

Kidney

Nightsweats

P 06179

P 06189

C 001

C 001

C 001

NS(nervous system)

NS(nephrotic syndrome)

NS(normal saline)

NS

Tissuefluid

ME mention

Candidateentity

Target entity

Collaborator-linking entity

Descriptioncontent

P 06177

Figure 3 Example of linking the ME mention ldquoNSrdquo

Table 4 Statistics of the corpus

Corpus MEs MEs linking to KB NIL

Family-doctor 2531 1524 1007

Muzhi-doctor 1876 109 780

Qiuyi 2189 1201 988

8 Journal of Healthcare Engineering

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

the experimental data has low coverage for real data) we ran-domly select 30 records for learning all the above-mentionedthreshold values In the ME recognition module the thresh-old θ for name similarity between a candidate and a medicalterm in LRs is experimentally set as 05 The threshold δ forthe text distance constraint is experimentally set to 3 Thismeans that if the text distance between a medical termand a candidate is lower than 3 the medical term is out-put instead of the candidate In the ME linking modulethe thresholds α and β are experimentally set as 05 and047 respectively In particular the threshold α is usedto compute the name similarity between a mention andan entity in the KB and the threshold β is used for cate-gory consistency

431 Medical Entity Recognition As mentioned above ourME recognition module is divided into two stages boundarydetection and entity classification In order to evaluate theeffectiveness of our proposed methods fully we show theexperimental results of each stage in detail

(1) Boundary Detection To validate the effectiveness ofonline detection Figure 4 presents the experimental resultsafter offline detection and online detection for all datasetsRecall has a noticeable improvement after the online

detection process It is therefore proven that online detectionis efficient in solving the limitation problem of thedictionary-based method However the precision has somelimitations The main limitation is that some irrelevant termsin the candidate set are not filtered by the online detectionprocess Therefore in the entity classification stage we addthe filtering threshold to remove these terms

(2) Entity Classification To evaluate the entity classificationmethod on its own we conduct an experiment with thestandard entity boundaries for all medical entities in the cor-pus Assuming that all medical entities have been extractedcorrectly from text and that our task is to classify them intothe predefined categories Table 5 presents the classificationresults of each corpus The overall performance is significantat an 8185 precision level and a 7584 recall level Thelower recall is because when filtering the nonmedical entitiessome medical entities are removed by the filtering thresholdthereby reducing the recall The performance of the targetcategories ldquosymptomrdquo ldquotreatmentrdquo and ldquocheckrdquo is somewhatlow One possible reason for this is that these entities aremostly classified based on the context signature similarityHowever the lack of the identifying information in thecontext reduces the similarity score thus impacting theclassification performance

Offline detection Online detection

Muzhi-doctor

PrecisionRecallF1

50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(b)

Muzhi-doctor

PrecisionRecallF1

Offline detection Online detection50

60

70

80

90

100

(c)

Figure 4 Experimental results after offline detection and online detection on the corpus ()

9Journal of Healthcare Engineering

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

(3) Overall Recognition Performance We compare the overallperformance of our recognition approach (named ldquounMERrdquo)with the unsupervised and supervised methods describedabove in Figures 5ndash7

Figure 5 shows the experimental results of unMERcompared with the bubble-bootstrapping approach This isbecause we only acquired the seeds of the symptom categoryfor bubble-bootstrapping The results show that unMERsignificantly outperforms bubble-bootstrapping in terms ofrecall However unMERrsquos precision is slightly low Onepossible reason for this is that the symptomatic entity men-tions are diverse resulting in low coverage in the offlineLRs Therefore they are mainly recognized by the onlinedetection method However the combined mentions pro-duce diverse search results from which it is difficult to get acomplete term For example for the mention ldquo手脚无力rdquo(powerless hands and feet) the returned results containldquo手脚无力rdquo (powerless hands and feet) ldquo四肢无力rdquo (power-less limbs) and ldquo手脚发软rdquo (limp hands and feet) Afteronline detection the acquired entity is ldquo手脚rdquo (hands andfeet) or ldquo无力rdquo (powerless) In addition the low recall ofthe bubble-bootstrapping approach is because the onlineQampA text lacks normalization in its description reducingthe performance of pattern matching

Figure 6 shows the experimental results of unMERcompared with the BM-NER approach Obviously unMERoutperforms BM-NER in both precision and recall The valueof F1 of unMER increases 2612 2752 and 2578 onthree corpora The reasons are as follows (1) The BM-NERapproach uses a noun phrase chunker to extract candidateentities which does not consider the nested entities therebyreducing the recall In addition the chunker utilizes a com-mon NLP tool which had poor recognition performancefor the medical entity boundary (2) The IDF filter removesmany common medical entities (3) We exploit a distributedword embedding approach to acquire the word vector whichwell considers the semantic similarity between words thanthe TF-IDF algorithm of BM-NER (4) Our built dictionarycontains many incorrect seed categories and this resultedin semantic deviation for the BM-NER approach reducingthe classification performance

Figure 7 shows the experimental results of unMER com-pared with Dic-CRF on each corpus Note that for the bodycategory we do not have the features of Dic-CRF and hencedo not present its measuring result On three corpora the F1

value of unMER increases 1501 1368 and 1268 thanDic-CRF approach respectively By analyzing the experi-ments we find that the high recall of unMERmainly dependson the online detection process which demonstrates thevalidity of using a search engine for recognizing medical enti-ties However Dic-CRF uses a medical dictionary for wordsegmentation this can easily lead to incorrect segmentationespecially for the combined entities In addition the definedfeatures have low coverage in all entity types which is alsoa reason for the low recall Moreover the informal descrip-tion of the online medical text also reduces the recognitionperformance of the CRF model In terms of precisionunMER yields comparative results and even exceeds Dic-CRF in some categories This is due to our combination ofmultiple offline LRs thereby increasing the coverage ofmedical entities Moreover unMER has good recognitionperformance in the nested entities

432 Medical Entity Linking Figure 8 shows the linkingresults of our approach (named ldquounMELrdquo) compared withthe QCV approach on each corpus To evaluate the linking

Muzhi-doctor

Precision

Recall

F1

Family-doctor Qiuyi0

20406080

100120140160180200220240260

Figure 5 Experimental results of unMER versus bubble-bootstrapping on the symptom category only (note the cylinderwith bias represents unMER and the other cylinder representsbubble-bootstrapping)

Table 5 Entity classification results on the corpus ()

Entity categoryFamily-doctor Muzhi-doctor Qiuyi

P R F1 P R F1 P R F1All 8249 7913 8078 8091 7386 7722 8216 7435 7806

Body 8517 8120 8314 8362 8013 8184 8353 8021 8184

Disease 8045 8216 8130 8196 8267 8231 7926 8168 8045

Symptom 7819 6084 6843 7454 6173 6753 7626 6152 6810

Medicine 8231 7962 8094 8063 7584 7816 8457 7839 8136

Treatment 7686 6725 7173 7524 6359 6893 7661 6182 6842

Check 7514 6553 7001 7450 6726 7070 7348 6372 6825

10 Journal of Healthcare Engineering

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

approach on its own we conduct an experiment with thestandard entity boundaries for all medical entities in thecorpus Assume that all entities have been extracted correctlyfrom text and our task is only to link them to the correctentities in the KB Compared to the QCV approach the F1value of unMEL increases 639 667 and 581 and theaccuracy value increases 603 46 and 554 on eachcorpus respectively This is possibly due to the similarrelationship in the KB between the mentions within thespecific window QCV virtually uses the context similarityfor linking Therefore the noise and lack of information inthe context reduce the linking performance HoweverunMEL alleviates the restriction by extracting the relevantcontext information and using semantic correlation More-over in the recognition module we modify the misspelled

ME mentions which help link to the correct entitiesNevertheless unMEL utilizes the fuzzy string matchingto generate candidate entities which omits some targetentities that are fully different in the surface form reducingthe linking recall

433 Overall System Performance To evaluate the overallperformance of our framework (unMERL) Table 6 showsthe linking results by conducting an experiment with ourrecognized entities Compared to the above linking resultsboth the precision and recall show some decline The reasonis that unMERL obtains some inexact entities in the bound-ary detection step In addition unMERL removes somemedical entities when filtering the nonmedical terms in theclassification step

Family-doctor

PrecisionRecallF1

020406080

100120140160180200220240260

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(a)

PrecisionRecallF1

020406080

100120140160180200220240260

Muzhi-doctor

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(b)

PrecisionRecallF1

020406080

100120140160180200220240260

Qiuyi

Chec

k

Body

Dise

ase

Trea

tmen

t

Med

icin

e

Sym

ptom

(c)

Figure 6 Experimental results of unMER versus BM-NER on the corpus (note the cylinder with bias represents unMER and the othercylinder represents BM-NER)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Family-doctor

(a)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Muzhi-doctor

(b)

020406080

100120140160180200220240260

Body

Dise

ase

Chec

k

Med

icin

e

Sym

ptom

Trea

tmen

t

PrecisionRecallF1

Qiuyi

(c)

Figure 7 Experimental results of unMER versus Dic-CRF on the corpus (note the cylinder with bias represents unMER and the othercylinder represents Dic-CRF)

11Journal of Healthcare Engineering

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

5 Conclusions

Medical entity recognition and linking are challenging tasksin Chinese natural language processing In this paper wehave described an unsupervised framework for recognizingand linking medical entities from Chinese online medicaltext namely unMERL To the best of our knowledge thisis the first complete unsupervised solution for Chinese med-ical text with both medical entity recognition and linking Ithas considerable value in many applications such as medicalKB construction and expansion semantic comprehension ofmedical text and medical QampA systems Experimental evi-dences show that unMERL consistently outperforms currentapproaches In addition due to its unsupervised nature andlanguage independence unMERL has good generalizability

In the future we will improve unMERL in the followingways Firstly we will improve the online detection approachby adding in-depth textual analysis in extracting medicalterms from the search results Secondly we will improvethe linking approach by introducing semantic analysis

Disclosure

The authors alone are responsible for the content and writingof the paper

Conflicts of Interest

The authors report no conflicts of interest

Acknowledgments

This work is supported in part by the National BasicResearch and Development Program (2016YFB0800303)the National Key Fundamental Research and DevelopmentProgram of China (2016QY03D0601 2016QY03D0603) theNational Natural Science Foundation of China (61502517)

References

[1] C Friedman P O Alderson J H M Austin J J Cimino andS B Johnson ldquoA general natural-language text processor forclinical radiologyrdquo Journal of the American Medical Informat-ics Association vol 1 no 2 pp 161ndash174 1994

[2] T C Rindflesch L Tanabe J N Weinstein and L HunterldquoEdgar extraction of drugs genes and relations from thebiomedical literaturerdquo in Proceedings of the Pacific Symposiumpp 517ndash528 Honolulu Hawaii USA 2000

[3] A R Aronson ldquoEffective mapping of biomedical text to theUMLS Metathesaurus the MetaMap programrdquo ProceedingsAmia Symposium vol 2001 no 1 p 17 2001

[4] S Kraus C Blake and S L West ldquoInformation extractionfrom medical notesrdquo Opening Schools for All vol 13 pp 95ndash103 2007

[5] B L Humphreys D A B Lindberg H M Schoolman andG Octo Barnett ldquoThe unified medical language system aninformatics research collaborationrdquo Journal of the AmericanMedical Informatics Association vol 32 no 4 p 281 1993

[6] C J Mcdonald J M Overhage W M Tierney et al ldquoTheRegenstrief medical record system a quarter century experi-encerdquo International Journal of Medical Informatics vol 54no 3 pp 225ndash253 1999

[7] K Donnelly ldquoSNOMED-CT the advanced terminology andcoding system for eHealthrdquo Studies in Health Technology ampInformatics vol 121 no 121 p 279 2006

[8] S Zhang and N Elhadad ldquoUnsupervised biomedical namedentity recognition experiments with clinical and biologicaltextsrdquo Journal of Biomedical Informatics vol 46 no 6pp 1088ndash1098 2013

Table 6 Experimental results of unMERL on the corpus ()

CorpusunMERL

P R F1 A

Family-doctor 8264 7326 7767 8323

Muzhi-doctor 8337 7441 7864 8348

Qiuyi 8215 7279 7719 8205

Family-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(a)

Muzhi-doctor

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(b)

Qiuyi

PrecisionRecall

AccuracyF1

QCV unMEL50

60

70

80

90

100

(c)

Figure 8 Experimental results of unMEL versus QCV on the corpus (note the cylinder with bias represents unMEL and the other cylinderrepresents QCV)

12 Journal of Healthcare Engineering

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

[9] D Movshovitz-Attias and W W Cohen ldquoBootstrappingbiomedical ontologies for scientific text using NELLrdquo inBioNLP 12 Proceedings of the 2012 Workshop on Biomedi-cal Natural Language Processing pp 11ndash19 MontrealCanada 2012

[10] L Z Feng Automatic Approaches to Develop Large-scale TCMElectronic Medical Record Corpus for Named Entity Recogni-tion Tasks Beijing Jiaotong University 2015

[11] E Riloff and R Jones ldquoLearning dictionaries for informationextraction by multi-level bootstrappingrdquo in AAAI 99IAAI99 Proceedings of the sixteenth national conference on Artificialintelligence and the eleventh Innovative applications of artificialintelligence conference innovative applications of artificial intel-ligence pp 474ndash479 Menlo Park CA USA 1999

[12] G D Zhou J Zhang J Su D Shen and C L Tan ldquoRecogniz-ing names in biomedical texts a machine learning approachrdquoBioinformatics vol 20 no 7 pp 1178ndash1190 2004

[13] Y Wang Z Yu L Chen et al ldquoSupervised methods forsymptom name recognition in free-text clinical records oftraditional Chinese medicine an empirical studyrdquo Journal ofBiomedical Informatics vol 47 no 2 pp 91ndash104 2014

[14] Y F Lin T H Tsai W C Chou K-P Wu T-Y Sung andW-L Hsu ldquoA maximum entropy approach to biomedicalnamed entity recognitionrdquo in BIOKDD04 Proceedings of the4th International Conference on Data Mining in Bioinformat-ics pp 56ndash61 London UK 2004

[15] Y Wang and J Patrick ldquoCascading classifiers for named entityrecognition in clinical notesrdquo in WBIE 09 Proceedings of theWorkshop on Biomedical Information Extraction BorovetsBulgaria 2009

[16] B Settles ldquoBiomedical named intity recognition using condi-tional random fields and rich feature setsrdquo in JNLPBA 04Proceedings of the International Joint Workshop on NaturalLanguage Processing in Biomedicine and its Applicationspp 104ndash107 Geneva Switzerland 2004

[17] J Liang X Xian X He et al ldquoA novel approach towardsmedical entity recognition in Chinese clinical textrdquo Journalof Healthcare Engineering vol 2017 Article ID 489896316 pages 2017

[18] Y Su J Liu and Y Huang ldquoEntity recognition research inonline medical textsrdquo Acta Scientiarum Naturalium Universi-tatis Pekinensis vol 52 no 1 pp 1ndash9 2016

[19] J Lei B Tang X Lu K Gao M Jiang and H Xu ldquoA compre-hensive study of named entity recognition in Chinese clinicaltextrdquo Journal of the AmericanMedical Informatics Associationvol 21 no 5 pp 808ndash814 2014

[20] C Y Qu Research of Named Entity Recognition forChinese Electronic Medical Records Harbin Institute ofTechonology 2015

[21] G Glava ldquoTAKELAB medical information extraction andlinking with MINERALrdquo in Proceedings of the 9th Interna-tional Workshop on Semantic Evaluation (SemEval 2015)pp 389ndash393 Denver CO USA 2015

[22] J G Zheng D Howsmon B Zhang et al ldquoEntity linking forbiomedical literaturerdquo BMC Medical Informatics and DecisionMaking vol 15 no S1 2015

[23] H Wang J G Zheng X Ma P Fox and H Ji ldquoLanguage anddomain independent entity linking with quantified collectivevalidationrdquo in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing pp 695ndash704 LisbonPortugal 2015

[24] D Nadeau and S Sekine ldquoA survey of named entity recogni-tion and classificationrdquo Lingvisticae Investigationes vol 30no 1 pp 3ndash26 2007

[25] B L Li E S Li and Y H Wei ldquoFunction design implemen-tation and applications of Chinese SNOMED 34rdquo in NationalConference on Medical Informatics 1999

[26] T Mikolov K Chen G Corrado and J Dean ldquoEfficientestimation of word representations in vector spacerdquo 2013httpsarxivorgabs13013781

[27] W Shen J Wang and J Han ldquoEntity linking with aknowledge base issues techniques and solutionsrdquo IEEETransactions on Knowledge and Data Engineering vol 27no 2 pp 443ndash460 2015

13Journal of Healthcare Engineering

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: Unsupervised Medical Entity Recognition and Linking in Chinese …downloads.hindawi.com/journals/jhe/2018/2548537.pdf · 2019-07-30 · Research Article Unsupervised Medical Entity

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom