resolving healthcare forum posts via similar thread...

Resolving Healthcare Forum Posts via Similar ThreadRetrieval

Jason H.D. Cho, Parikshit Sondhi, Chengxiang Zhai, Bruce R. SchatzDepartment of Computer Science

University of Illinois at Urbana-ChampaignUrbana, IL, 61801

{hcho33, sondhi1, czhai, schatz}@illinois.edu

ABSTRACTWeb communities such as healthcare web forums serve aspopular platforms for users to get their complex medicalqueries resolved. A typical forum thread contains a queryin its first post, and a discussion around it in subsequentposts. However many users do not receive satisfactory re-sponses from other members in the community, leaving themdissatisfied. We propose to help these users by exploiting anexisting collection of discussion threads.

Often many users suffer from the same medical conditionand start multiple discussion threads on very similar queries.In this paper we develop and evaluate a plethora of special-ized search methods that treat an entire unresolved forumpost as a query, and retrieve forum threads discussing simi-lar problems to help resolve it. The task is more challengingthan a traditional document retrieval problem, since forumposts can contain a lot of irrelevant background information.The discussion threads to be retrieved are also quite differentfrom traditional unstructured text documents. We evaluateour results on a dataset comprising over 350K discussionthreads and show that our proposed methods outperformstate of the art retrieval methods for the task. In particular,method based on non-uniform weighting of thread posts andsemantic analysis of the query text perform quite well.

Categories and Subject DescriptorsH.4 [Information Systems Applications]: Miscellaneous

General TermsAlgorithms, Experimentation

KeywordsMedical case retrieval, recommender system, web forums,shallow information extraction, forum thread retrieval

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]’14, September 20–23, 2014, Newport Beach, CA, USA.Copyright 2014 ACM 978-1-4503-2894-4/14/09...$15.00.http://dx.doi.org/10.1145/2649387.2649399 .

1. INTRODUCTIONUsers often consult online sources such as medical web

pages, clinician’s blogs, or medical web forums when they arefaced with medical problems. A typical session may consistof users browsing through the web to find answers, or postingmedical questions on the web. These behaviors serve to re-assure, confirm, or educate the users about the symptoms ortreatments that they are curious about. Healthcare forumssuch as HealthBoards1 and MedHelp2 provide a platform tousers for getting answers to their medical case queries. Atypical forum thread contains a case query in its first post,and a discussion around it in subsequent posts. An exampleis shown in Figure 1.

Figure 1: Sample healthcare discussion thread

A user may prefer to post her medical query on a webforum for several reasons. These include - difficulty in de-composing a complex medical case query into short keywordqueries suitable for web search, lack of skill to accuratelyperform web searches, or simply the desire to obtain infor-mation from human experts. In a related work [23], Liuand coauthors show that in many cases failed searches leadusers to ask questions on web communities and studied thistransition in detail. As a result, web forums have become ex-tremely popular. For example HealthBoards and MedHelpreceive over 10 and 12 million visitors each month34.

However, prior research has also shown that a sizeable per-centage of user queries aren’t satisfactorily resolved [3]. Oneway to help these users is by exploiting an existing collection

1http://www.healthboards.com2http://www.medhelp.org3http://www.healthboards.com/about.php4http://www.medhelp.org/aboutus.htm

of discussion threads. Often many users suffer from the samemedical condition and start multiple discussion threads onvery similar case queries. As a result, many queries maybe resolved by directing users to relevant existing threads inthe collection.

In this paper we propose a novel information provisionparadigm for web forums. We envision an autonomous agentthat automatically responds to an unresolved user query byposting an automated response containing links to threadsdiscussing similar medical problems. A sample response isshown in figure 2.

The following threads discuss similar problems:

� Doritos Allergy Very Severe and New

� Certain Foods + Beer = Flushing and Head Pounding…Help!

� Peanut/Food Allergies

……………………

Figure 2: A medical query with its envisioned au-tomatically generated response containing threadsdiscussing similar problems.

From a high level computational point of view, the mainchallenge in realizing our vision is to develop methods capa-ble of finding threads similar to a given case query. In thispaper we focus on this relevance challenge. We treat thefirst post of an unresolved thread as a medical case query,and retrieve similar threads from a collection. This setupdiffers from a traditional retrieval task in two important as-pects. First, the case query, which is formulated primarilyto elicit responses from other laypersons in the community,can contain a fair amount of background non-medical infor-mation such as emotional statements like“Help me!” or“I’mfed up with this” etc. It therefore becomes necessary to sepa-rate medical case related information from such background.This task becomes especially difficult since medical entity ex-tractors do not always work so well due to the noisy natureof forum data. The second difference stems from the natureof documents to be retrieved, i.e. relevant forum threads. Aforum thread is more than just a bag or sequence of words.It is subdivided in a sequence of well defined posts and thisinternal structure must be considered when formulating arepresentation.

In this paper, we raise several questions and propose hy-potheses to better understand and tackle the case retrievaltask. Our findings are: 1) Forum posts towards the end ofthe thread are less useful than those at the beginning. 2)Boosting medically relevant sentences helps with retrieval,while boosting medical entities does not. 3) Forum cate-gories are useful signals in improving the retrieval task. Weevaluate the methods on a collection of over 350K healthcareforum threads and report our proposed methods improveperformance over the baseline approach.

The rest part of this paper is organized as follows: Re-lated works are described in Section 2. Section 3 formally

defines the problem of case retrieval. Our approach is de-scribed in detail in Section 4. Section 5 and Section 6 de-scribe and show results of our experiments. In particular, wefirst analyze the effectiveness of each of the semantic typeswe introduce in this paper, and then analyze the impact ofthe parameters. We conclude and propose future works inSection 7.

2. RELATED WORKSMedical experts base their diagnoses on a mixture of text-

book knowledge and experience acquired through clinicaltrials [28]. These use cases have necessitated in case-basedreasoning (CBR) [2] aid systems which help practitionersdiagnose patients by comparing their case with that of theprevious, similar ones. These systems [21, 20, 28] extractsrelevant cases from electronic medical records (EMR) to aidprofessionals better diagnose the current patients. EMR,however, contain confidential information and it is not pos-sible for average users to retrieve these documents. Medicalforums can be used as cases for these users instead.

To the best of our knowledge, there has not been anywork that retrieves medical cases from medical web forums.A close analogy to our work is recommender systems. Thesesystems have been used to recommend products [31], or vari-ous web documents [19, 22, 33], and can be divided into twodifferent approaches. Collaborative filtering utilizes users’previous behaviors, activities, or preferences to recommendwhat the users might want next [8] and are especially use-ful when it is possible to exploit similarities between usersand products. Content-based filtering, on the other hand,leverages the content structures such as texts, or structureddata to aid recommendation process. The latter approachhas been explored in news [19, 22], micro-blogs [13] and fo-rum thread recommendations [33]. Our approach is basedon content-based filtering because often times users tend tohave limited scope of interest, and individual users tend topost very limited number of thread posts.

Another line of related work is retrieving relevant doc-uments from questions and answers archives. These tasksrange from extracting documents from questions and an-swers communities such as Yahoo! answers [15, 36] or utiliz-ing frequently asked questions pages [17]. Laypersons gen-erate answers to questions which is, to some degree similarto extracting forum posts. Question-answer archives do nothave thread structures whereas forum posts do, which allowsforum retrieval systems to exploit complex user interactions.

There are some existing works in forum retrieval tasks aswell. Some have looked into forum search [32, 10, 4] andothers into recommender systems [33]. These papers haveshown that utilizing thread structures helps forum retrievaltasks. We have similarly exploited thread structures [27]to aid retrieval task. What separates our work from theprevious work is that we utilize semantics that suit medicalcase retrieval tasks such as medical entities, and descriptiontypes of each sentences during case retrieval, which none ofthe aforementioned forum-based systems have leveraged.

Medical forums have been used to generate medical hy-potheses such as predicting drug effectiveness [7, 9], cluster-ing outcomes [16], or summarizing effects of addiction [14].While we also use medical web forums as our source, ourtask is more general in that we retrieve medical cases ratherthan to generate medical hypotheses.

3. FORMALIZING THE FORUM CASE RE-TRIEVAL PROBLEM

We treat the problem of similar thread finding as a spe-cialized retrieval task with the first post of an unresolvedthread as a query and each thread in our existing threadarchive as a document. In this section, we start by provid-ing some definitions to make the problem more precise andthen discuss the various design objectives which guide thedevelopment of our methods.

Definition 1 (Forum Post): A forum post p is a sequenceof words in a vocabulary set V .

Definition 2 (Forum Thread): A forum thread t is asequence of posts, i.e., t = [p1, ..., pL], where pi is the i-thpost in the thread. In subsequent discussion we will alsofrequently refer to a forum thread as a document.

Definition 3 (Collection): A collection C is defined as aset of forum threads C = {t1, t2, ...tn}, where ti is a thread.

Definition 4 (Case Query): A case query q is defined asthe sequence of words in a vocabulary set V which is inputto the system for finding similar cases.

Our goal in forum case retrieval is to, given a query q,assign a relevance score Score(q, ti) to each thread ti in thecollection C and return a list of top 5 threads ranked basedon their relevance scores as output.

4. FORUM CASE RETRIEVAL METHODSWe start by evaluating the performance of a state of the

art baseline retrieval method and then extend it by incor-porating various task related characteristics to improve per-formance. Our main goal is to study 5 high level questions.

1. How well do state of the art traditional retrievalmethods perform?We study this question by directly applying state ofthe art retrieval methods and evaluating performance.We treat these methods as our baseline.

2. Do better thread representations help improveperformance?To study this question we utilize thread representa-tions that don’t treat the content of all posts in athread equally. We refer to these as Post Weightingmethods.

3. Does forum category relevance help improveperformance?We investigate on this question by weighting threadsdifferently based on the number of forum categoriesthat are retrieved. We refer to these as forum cate-gory weighting methods.

4. Does incorporating medical semantic informa-tion help improve performance?It is important to separate crucial medical case re-lated keywords from the sizable background informa-tion present in a query. We explore two different se-mantic methods for extracting important medical caserelated keywords, and show how we can incorporate

this information into retrieval functions. We refer tothese as semantic weighting methods.

5. Does combining different methods help?We also study whether combination methods that com-bine two or more methods together outperform indi-vidual methods.

In subsequent sections we discuss these in detail.

4.1 Baseline ApproachesWe use the popularly used BM25 retrieval model as a

baseline for the task. BM25 was originally developed byRobertson et. al. [30] for TREC ad-hoc filtering task andhas since been extremely popular as a state of the art generalretrieval method. The function can be efficiently computedeven on fairly large collections. The relevance score of athread t to a query q is computed as

Score(q, t) =∑w∈V

logN − n(w) + 0.5

n(w) + 0.5

c(w, t)(k1 + 1)

c(w, t) + k1(1− b+ b b|t|avgtl

)

(k3 + 1)c(w, q)

k3 + c(w, q)(1)

where w ∈ V represents a word in vocabulary V . N is thetotal number of threads in the collection C. n(w) representsthe number of threads in the collection that contain wordw. c(w, t) and c(w, q) represent the frequency of w in thethread t and query q respectively. Finally |t| is the lengthof a thread in terms of total count of all words appearing init and avgtl is the average length of all the threads presentin the collection. The value of parameters k1, k3 and b aregenerally set between 1− 2, 0− 1000 and 0.75 respectively.

We use two different baseline approaches based on howthe content of a thread is represented.

4.1.1 Thread BM25 (T -BM25)Under this method a thread is considered as a bag of words

containing all its posts. We give equal importance to eachpost and the thread keyword frequency c(w, t) of each wordw is calculated by counting all its occurrence in all the posts.

4.1.2 First Post BM25 (FP -BM25)Under this method the thread keyword frequencies are

obtained by considering only the keywords in the first postof the thread. Content of all subsequent posts are ignored.This approach assumes that the first post in a thread is likelythe most representative of its case and hence its keywordsare most critical.

4.2 Position Based Post WeighingSo far, we have always treated each post in a thread

equally. However, intuitively, not all posts in a thread areequally good in reflecting the problem being discussed in athread. Indeed, the first post of a thread often defines themedical case to be discussed. The following posts possiblysuggest solutions to the problem posed or veer into otherlines of discussion. As a result, not all posts are equallyrepresentative of the thread and non-uniform weighting ofposts is presumably beneficial.

The challenge then is to assign non-uniform weights toposts so that we can potentially improve the content repre-sentation of a thread. Below we propose using two differ-

),()3,1( 1pwcf

∑=

=3

1

),()3,(),('i

ipwciftwc

),()3,3( 3pwcf

∑=1i

Figure 3: Sample post weighting for K = 3. f(i,K)gives the weight of post i in a thread with K posts.

ent schemes: Monotonic Post Weighting and Parabolic PostWeighting.

The two schemes were initially introduced to cluster forumthreads [27]; we present the methods in this paper. It shouldbe noted that instead of utilizing the technique to clusterforum threads, as is the case in [27], we take a step furtherand apply the techniques to retrieve relevant threads.

In both Monotonic Post Weighting and Parabolic PostWeighting schemes, weights are assigned to posts based ontheir relative position in the thread. Specifically, let t be athread with K posts [p1, p2, ...pK ]. The weight pw(pi, t) of apost pi in t is then solely a function of the position variablei and the number of posts in the thread K i.e.,

pw(pi, t) = f(i,K), 1 ≤ i ≤ KOnce defined, the weight is incorporated in the relevance

scoring function by replacing the thread word count termc(w, t) in equation 1, with an altered thread term countc′(w, t) defined as

c′(w, t) =

K∑i=1

f(i,K)c(w, pi) (2)

where c(w, pi) is the count of word w in post pi. SeeFigure 3 for an example. The exact definition of the functionf() varies in the two schemes. We use fm() to representmonotonic and fp() to represent parabolic post weightingfunctions.

4.2.1 Monotonic Post WeightingIn this scheme, we hypothesize that the representativeness

or the importance of a post in a thread reduces as its po-sition in the thread increases, i.e., the later a person postsin a thread, the lesser is the weight of the post. This isbest represented by a weighting scheme that monotonicallydecreases the weight assigned to posts as their positions in-crease. In our experiments, we use the following function:

fm(i,K) =( 1i)βm∑K

j=1( 1j)βm

(3)

where βm is a decay parameter that can be tuned to adjustthe rate at which the weight of the post reduces as the post’s

position increases. The weight normalization included in thedenominator ensures that the weights assigned to posts of athread sum up to unity. When βm = 0, this transforms toall posts getting equal weights. As βm increases, the dropin post weights is more pronounced and less gradual.

4.2.2 Parabolic weighting of postsThe parabolic weighting scheme is based on the observa-

tion that, oftentimes, discussion in a thread stops once asolution to the problem has been posted. Hence intuitivelywhile the initial posts tend to represent the problem well,the posts towards the end are more likely to be representa-tive of the solution. Since the critical aspects of a threadare likely to depend both on the problem keywords as wellas the solution keywords, we should assign higher weights toboth the initial and the final posts of a thread. This is wellmodeled by the following skewed parabolic function:

gp(i,K) =(i− βpK)2

(1− βpK)2fp(i,K) =

gp(i,K)∑Kj=1 gp(j,K)

(4)

where K is the total number of posts in the thread and βp ·K is the position at which the post weight is minimum. Wefirst calculate a function gp(i,K) which is parabolic w.r.t theposition variable i and obtain the final post weights fp(i,K)by normalizing to ensure a unity sum for the weights. Thisnormalization also eliminates the divide by zero exceptionwhen βp = 1

K.

It should be noted βp = 0.5 represents the case wherethe initial posts of a thread are weighted equally with thefinal posts. As βp increases, the initial posts are assignedhigher and higher weights relative to the final posts. Whenβp = 1, the weighting function transforms into a monoton-ically decaying function with the final posts assigned theleast weight.

4.3 Forum Category WeightingMedical forum threads are usually organized under high

level disease categories as shown in Figure 4. Forum cate-gory feedback is motivated by the idea that cases that usersare interested in are likely to be limited to only a few num-ber of forum categories. As an example, a breast cancerpatient searching for treatment comparison is not likely tofind threads from allergy forums very useful. By boostingforum categories that are likely to contain cases that usersmay find interesting, and de-emphasizing those that maynot, it is possible to improve retrieval performance. Thisis a convenient extension of case retrieval because forums,by nature, are divided into many coarse and fine-grainedcategories.

4.3.1 Forum Category Uniform WeightingIn forum category uniform weighting (FCUW) scheme,

we boost by γu on documents whose categories are amongstthe top k most frequent from the retrieved documents. Theintuition behind this approach is the top k forum categoriesmost likely contain cases that users are interested in. Moreformally, we set the new scores on query q and thread t as

Score(q, t) = Score(q, t) + γu

for all forums whose categories appear frequently amongstthe retrieved results. We set k = 5 for our experiments, andfound γu through cross validation.

Figure 4: Forum categories under the topic ‘mentalhealth.’

4.3.2 Forum Category Feedback WeightingWe model forum category feedback weighting as follows.

Based on user’s case query, forum threads are retrieved fromthe system. Each of the retrieved threads have forum cat-egories associated with them. If the categories are labeled

randomly, each category would appear c(r)|C| number of times,

where C represents all possible forum cateogories and c(r)represents number of retrieved documents. On the otherhand, if categories are not labeled randomly, we can define

the probability as p(ForumId = f |r) = c(t,f)c(r)

, where c(r) is

the number of retrieved documents, and c(w, f) correspondsto the count of forum category that the thread i is associ-ated with. Forum categories that appear less than randomchance should be penalized, while those that appear moreshould be boosted. Using this intuition, we model the feed-back for thread t by using log-likelihood ratio as follows:

FCFW (t) = logp(ForumId = f |r)p(ForumId = f)

(5)

The new score for each of the threads is now defined as

Score(q, t) = Score(q, t) + γf ∗ FCFW (t)

where γf is a tunable parameter.Forum category feedback weighting (FCFW) is analogous

to relevance feedback [29] from existing literature. In partic-ular, ours is closely related to pseudo-relevance feedback [6,24]. The difference between pseudo-relevance feedback andFCFW is we directly exploit forum categories, whereas pre-vious works focused more on modeling words or topics. No-tice that unlike FCUW scheme, FCFW scheme does notneed any hard constraints on the number of top k categoriesto boost since the number of retrieved threads dictates theweight of the category of interest.

4.4 Semantic Weighting ApproachesThe goal of semantic weighting approaches is to help iden-

tify query keywords that are most representative of the case,and weigh them separately from the background. This isachieved by perturbing the query frequency c(w, q) of certainkeywords by introducing additional parameters. In particu-lar, we wish to find out which granularity (entity level versussentence level) of semantic weighting approaches works bet-ter. We introduce two methods that operate at differentlevels of granularity.

4.4.1 Medical Entity ExtractionOur first semantic weighting approach was to use an out

of the box medical entity extractor to identify individualmedical case related entities from the query text. Variousmedical entity extractors are available for the purpose, butonly ADEPT[25] has been specifically trained on medicalforums. The algorithm is based on Conditional RandomFields, and the authors have shown that it achieved F1 scoreof 0.84 while all the other algorithms that were trained onnon-medical forum domains, including MetaMap[1] whichis popularly used for literature data, achieved F1 scores ofbelow 0.5.

The ADEPT framework treats the problem of finding med-ical phrases as named entity recognition problem. The au-thors utilized Stanford Named Entity Recognizer [12] totrain a 2-class (medical terms v.s. non-medical terms) model.The training data were annotated by crowdsourcing the taskto Amazon’s Mechanical Turk [35]. Mechanical Turk work-ers, for their task, were asked to find words or phrases thatdenotes medical terms on forum posts. A reasonably highinter-rater reliability score of 0.707 on Fleiss κ measure wasachieved using this approach.

The extracted medical entities are incorporated into therelevance scoring function by altering word query counts.In the baseline method, the count of a word in a queryc(w, q) is set to the number of occurrences of w in q. In thismethod, once we have applied a medical entity extractor,all word occurrences are either labeled as being a medicalentity or not. Let #med(w, q) be the number of occurrencesof word w that are labeled as a medical entity in query qand #nonmed(w, q) be the number that are not. We thenreplace the count c(w, q) in equation 1 by an altered countwith a tunable parameter αm.

c′(w, q) = αm ∗#med(w, q) + #nonmed(w, q)

While the approach is easy to apply and can identify enti-ties with a fairly high precision, it does not necessarily workso well on forum text, where many keywords representingnon-standard medical entities may also be very importantin representing a medical case. For example in the queryshown in Figure 5 we find that even though terms like Tos-titos and Doritos are crucial to representing the patient’scase, they are not identified as medical entities. This is-sue is dealt with in our next approach, where we performsentence level extraction.

4.4.2 Shallow Medical Information ExtractionThe second approach operates at a coarser level of gran-

ularity. In this case we label entire sentences as being rep-resentative of medical information, rather than individual

Figure 5: Sample medical entity extraction usingADEPT. allergic, stomach cramps and sleep areidentified as medical entities. Tostitos and Doritosare not identified.

I am severly allergic to some product that is found in both

Tostitos and Doritos, as well as random other types of chips.

I know the solution is "don't eat chips" but what could the

product be? I don't want to accidentally consume it. When I

eat this, I get very bad stomach cramps and it ruins the rest

Background (BKG)

Neither PE nor MED

of my day/night - the only solution is to go to sleep so I can't

feel it. Help! Any ideas on this?

Physical Examination (PE)

Disease, Symptoms

Medication (MED)

Treatment, Prevention

Figure 6: An example of PE(green),MED(red) andBKG(brown) sentences.

keywords. Each sentence in the query text is assigned oneof the following three categories. An example of this kind oflabeling is shown in Figure 6.

1. Physical Examination (PE): The sentence contains thedescription of diseases, symptoms etc.

2. Medication (MED): The sentence provides descriptionof treatment, medications or other measures taken toresolve the disease.

3. Background (BKG): Sentence is not covered in eitherPE or MED. Often covers sentences exhibiting emo-tional response.

The main intuition behind such a labeling is to separatethe critical case related sentences from the background. Werestrict ourselves to only three classes, since it is possibleto train reasonably accurate classifiers for them and at thesame time they are sufficient to represent the most promi-nent types of sentences appearing in medical forum texts.

The extraction was performed using a support vector ma-chine based classifier, which has been found to be the mostsuitable for the task. More details regarding shallow infor-mation extraction may be found in [34].

Once the labeling of sentences is complete, we employa weighing technique similar to that in the previous sec-tion. Only this time two parameters are introduced. Let#pe(w, q), #med(w, q) and #bkg(w, q) be the number oftime word w appears in PE, MED and BKG labeled sen-tences of query q. The modified relevance scoring function

is obtained by replacing the query count c(w, q) in equation1 by c′(w, q) defined as

c′(w, q) = αpe#pe(w, q) + αmed#med(w, q) + #bkg(w, q)

where αpe and αmed are tunable parameters.

4.5 Combination MethodsFrom the method details discussed above it is clear that

while semantic weighting techniques alter query word counts,the post weighting techniques alter thread word counts. Thusit is also possible to combine these methods together. Morespecifically c′(w, q) is generated using one of the semanticweighting methods and c′(w, t) is generated using one of thepost weighting methods. These are then plugged into equa-tion 1 for generating relevance scores. Since each techniquerepresents a different heuristic, we expect the performanceof combination methods to be better than their constituentmethods.

5. EXPERIMENTS

5.1 Evaluation Set ConstructionOur document collection comprised 350K threads crawled

from HealthBoards5, which is the largest healthcare forumon the web. We stored all of the threads in XML format,and then indexed the document using Apache Lucene Javasearch library [26].

The evaluation was done using 20 queries. Judgmentswere created via pooling [18], a strategy commonly used ininformation retrieval evaluation. For each query, top 10 re-trieved threads from all our methods were pooled togetherand judged as either relevant or non-relevant by a humanexpert. In all over 730 query-thread pairs were judged bytwo judges. Of these, 324 threads were found to be relevantand 406 irrelevant.

In order to ensure consistency in judgments, the two judgesfirst both labeled the same set of 100 query-thread pairs tocheck for inter-annotator agreement. The only annotationguideline was to consider the similarity between the symp-toms and intent of the query and the retrieved thread whilemaking the judgment. 88% of the judgments were foundto be in agreement and Cohen kappa was found to be 0.76.This suggested a reasonably high agreement implying thatthe annotation task was fairly well defined. This data setwill be made available at [retracted link].

5.2 Experiment DesignAll evaluations are conducted using Apache Lucene Java

search library [26]. Lucene is a highly scalable search enginethat is widely used both in the industry and in academiasettings. The framework supports structured search andcustomized weighting scheme, both of which we have usedto weigh different medical case descriptions, medical terms,and posts in forums.

Our first goal was to address the questions raised in Sec-tion 4. In particular, we started from how well traditionalretrieval methods performs on case retrieval task. Based onthe intuition we gained from this task, we investigated onutilizing thread and forum structures to improve retrieval

5http://healthboards.com

performance. We then investigated on which type of se-mantic information help with retrieval, and combined theproposed techniques to improve retrieval performance. Allmethods were evaluated using 5-fold cross validation.

The second goal was to look into the sensitivity and stabil-ity of parameters. We have introduced a number of parame-ters through our various methods. We wanted to study howrobust the performance is to some of the important parame-ters. Such an analysis can provide us with valuable insightson the stability of our proposed techniques.

All of our results are tested against FP -BM25 using Wilcoxonsigned rank test at 0.05-level. Unless otherwise noted, signif-icance is denoted by ∗, and () are performance improvementover FP -BM25.

5.3 Evaluation CriteriaPerformance of each method is measured using Precision

at 5 (P@5), Precision at 10 (P@10), Recall at 30 (R@30) andMean Average Precision (MAP ). P@5 and P@10 representsthe percentage of relevant documents in the top 5 and top 10results. R@30 represents the percentage of all the relevantdocuments in our collection that are present in the top 30results. Mean Average Precision is the arithmetic mean ofaverage precision values over a set of queries. Suppose, forsome ranking ri ∈ R , there are ki relevant documents inthe whole collection. Further, let rank(j) be the rank of jth

relevant document and P (rank(j)) be the precision at therank of the jth relevant document. Then

P (rank(j)) =#RelevantDocumentstillrank(j)

#Documentstillrank(j)=

j

rank(j)

Average precision of some ranking ri ∈ R and the meanaverage precision over a set of rankings R , are then givenby:

AP (ri) =

∑kij=1 P (rank(j))

ki,MAP (R) =

∑ri∈RAP (ri)

|R|

AP intuitively captures the average of precision at everypoint when a new relevant document is retrieved.

6. RESULTS

6.1 How well do state of the art traditional re-trieval methods perform?

In this section, we compared Thread BM-25 (T -BM25)and First Post BM-25 (FP -BM25). If FP -BM25 performsbetter than T -BM25, it implies that first posts are morehelpful than the rest of the posts in retrieving relevant posts.On the other hand, first posts have low retrieval utility if T -BM25 performs better than FP -BM25. Table 1 shows theretrieval results for T -BM25 and FP -BM25.

T -BM25 FP -BM25P@5 0.3000(−36.2%) 0.4700P@10 0.2200(−43.6%) 0.3900R@30 0.2846(−42.8%) 0.4975MAP 0.1977(−40.4%) 0.3316

Table 1: Retrieval performance for T -BM25 and FP -BM25.

FP -BM25 performed significantly better than T -BM25on all four metrics. This indicates the first post in thread isvery informative in retrieving relevant cases, and techniquesthat leverage thread structure should incorporate such in-formation. Based on this intuition, we next explored mono-tonic and parabolic post weighting approaches to investigatehow much weight should be given to first and its subsequentposts in threads.

6.2 Do better thread representations help im-prove performance?

We compared monotonic post weighting and parabolicpost weighting over the baseline method, FP -BM25. De-pending on the characteristics of the forum, monotonic postweighting may outperform parabolic post weighting and viceversa [27]. Here, we have two hypotheses. If posts at theend of thread tend to provide good medical advice to users,values of βm and βp would be small. In such cases, parabolicpost weighting approach will perform better than monotonicpost weighting, since the latter cannot assign higher weightsto posts that end thread. On the other hand, if users deriveimmediate help from the posts closer to the top, both βmand βp will be high. Our results are shown in Table 2.

We observed an 8.5% and 9.5% improvement on P@5and MAP , respectively, over FP -BM25 by utilizing mono-tonic post weighting scheme (at βm = 3). For the caseof parabolic post weighting, optimal performance was ob-served at βp = 0.8, suggesting that the posts at the endof the thread were not as important as those closer to thefirst post. Results from these two weighting schemes suggestthat, from the perspective of medical case retrieval, forumposts towards the end of the thread are less useful than thoseat the beginning.

FP -BM25 MonotonicBM25 ParabolicBM25P@5 0.4700 0.5100∗(8.5%) 0.5100∗(8.5%)P@10 0.3900 0.3950(1.3%) 0.4100(5.1%)R@30 0.4975 0.5240(5.3%) 0.5040(1.3%)MAP 0.3316 0.3631∗(9.5%) 0.3494(5.4%)

Table 2: Retrieval performance for FP -BM25 (base-line), MonotonicBM25 and ParabolicBM25.

In the next section, we further investigated on forum struc-tures by analyzing forum category relevance feedback.

6.3 Does forum category relevance feedbackhelp improve performance?

Forums are categorized into many sub-forums. On Health-boards.com, there are over two hundred sub-forums, eachbased on different medical symptoms. The intuition hereis, by boosting categories that appear often in the retrievedresults, we can remove posts that may have appeared bychance. We utilized this important characteristics by usingthe proposed forum category relevance feedback approaches,FCUW and FCFW . Our results are shown in Table 3.

Both FCFW and FCFW performs better in terms ofP@5, and have trends of performing better on MAP . Onthe other hand, FP -BM25 performs better on P@10 andR@30. This indicates that the forum category relevancefeedback should be used depending on the metric one wantsto improve upon.

FP -BM25 +FCUW +FCFWP@5 0.4700 0.5200∗(10.6%) 0.5100∗ (8.5%)P@10 0.3900 0.3600 (−7.7%) 0.3700 (−5.1%)R@30 0.4975 0.4678 (−7.0%) 0.4610 (−7.3%)MAP 0.3316 0.3334 (0.5%) 0.3389 (2.2%)

Table 3: Retrieval performance for FP -BM25 (base-line), FP -BM25 + FCUW and FP -BM25 + FCFW .

6.4 Does incorporating medical semantic in-formation help improve performance?

In the previous three sections, we proposed methods toexploit forum posts and cateogry structures. The next ques-tion we asked is, how can semantics help performance? Inparticular, on informal text sources such as web forums, doesboosting relevant entities retrieve better performance, or isit preferable to boost medically relevant sentence?

Among the semantic weighting methods, we observe thatkeyword weighting based only on medical entity extraction(FP -BM25 + MedEx) performs slightly worse than thebaseline. On the other hand, the sentence level shallowextraction method (FP -BM25 + ShallowEx) performs thebest among all three methods in terms of precision. Thisis because sentence level extraction also allows us to cap-ture the importance of non-medical keywords which, in thenon-technical language of forums, are useful in representinga medical case. Example of this can be seen in Figure 5and Figure 6. ADEPT framework boosts only the medi-cal entities, and hence, may not capture non-medical termsthat are relevant to understanding the user’s query. Shallowextraction, on the otherhand, boosts not only the medicalterms, but also its context, which improves the retrieval per-formance.

FP -BM25 +MedEx +ShallowExP@5 0.4700 0.4600(−2.1%) 0.5300∗(12.7%)P@10 0.3900 0.3100(−20.5%) 0.4000 (2.6%)R@30 0.4975 0.4283(−13.9%) 0.4847 (−2.5%)MAP 0.3316 0.2918(−12.0%) 0.3481 (4.9%)

Table 4: Retrieval performance for FP -BM25 (base-line), FP -BM25 +MedEx and FP -BM25 +ShallowEx.

6.5 Does combining different methods help?We further investigated on combination methods to see

if the proposed techniques are compatible with each other.In conducting this study, we chose the best performing ap-proach in each of the hypotheses we raised. We initiallycombined MonotonicBM25 with ShallowEx method. Wenoticed that all of the four metrics performed better than thebaseline, albeit not all of them were better from statisticallysignificant sense. Next, we added forum category relevancefeedback FCFW to see if the combined method performedsignificantly better than the baseline on all four metrics.Our results indicate that indeed, performance were signifi-cantly better than the baseline on all four metrics by com-bining MonotonicBM25, ShallowEx, and FCFW . Theseresults indicate that combining different methods is compat-ible. Results are shown in Table 5.

6.6 Parameter Analysis

In the previous sections, we looked at how different tech-niques can be utilized to improve retrieval performance. Inthis section, we analyzed how sensitive parameters are. Ifthe parameters are sensitive, setting optimal parameters isdifficult, and method may not be very robust.

First we look at analyzing the sensitivity of the two postweighting parameters βm and βp. For these experiments, wevary the value of the parameter being analyzed and evaluateits performance on all 20 queries. The results are shown inFigure 7. We observe that when varying the two parame-ters in the expected range of values, the performance variesgradually rather than fluctuating. This suggests that theparameters are easy to set and minor perturbations in theirvalues are unlikely to significantly hurt performance.

We also investigate on parameter sensitivity of shallowinformation extraction approaches, for αmed and αpe. Theresults are shown in Figure 8. From the graph it can beseen that the parameters are more sensitive than βm andβp. Both of the figures show interesting trends. Before theoptimal value is reached (approximately 0.8 for xPe and 1.7for xMed) performance improves drastically as more weightsare assigned to these semantics. However, after this point,the performance does not fall as rapidly.

Finally, we analyze how sensitive parameter γf on ForumCategory Feedback weighting parameters is. Similar to βmand βp, the parameter here is also not very sensitive. In par-ticular, the performance reaches a plateau at around γf = 5and slowly falls afterwards. The performance graph is shownin Figure 9.

Figure 7: Performance variation for parameter βm(top) and βp (bottom) on all 20 queries

7. CONCLUSIONIn this paper we presented our work on an autonomous

agent for healthcare forums. It resolves medical case-basedqueries present in the first post of unresolved threads, bygenerating a response containing top 5 threads that discussmedical cases most similar to the unresolved case. We ex-

Method P@5 P@10 R@30 MAPFP -BM25 0.4700 0.3900 0.4975 0.3316

FP -BM25 + ShallowEx 0.5300∗(12.7%) 0.4000(2.6%) 0.4847(−2.5%) 0.3481(4.9%)MonotonicBM25 + ShallowEx 0.5400∗(14.9%) 0.4050(3.8%) 0.5354∗(7.6%) 0.3745∗(12.9%)

MonotonicBM25 + ShallowEx+ FCFW 0.5200∗(10.6%) 0.4200∗(7.7%) 0.5625∗(13.1%) 0.3702∗(11.6%)

Table 5: Results for multiple method combinations.

Figure 8: Performance variation for parameter αmed(top) and αpe (bottom) on all 20 queries.

Figure 9: Performance variation for parameter γfon all 20 queries

plored a number of different approaches for the problem thatattempted to extend the state of the art general retrievalmethods for the task, by incorporating semantic and threadstructure based information.

The observations suggested that the task is clearly feasi-ble. Forum queries are formulated mainly by laypersons andhence contain non-technical language and a lot of non-caserelated background information. As a result even a stateof the art medical entity extractor is unable to identify allof the important case related keywords. We instead needsentence level extraction of important case related sentencesto separate them from background sentences. For our workthree classes PE, MED and BKG seemed sufficient for thetask. The resulting method when combined with rank based

monotonic post weighing scheme and forum category rele-vance feedback achieved the best performance.

Although we explored medical forums in this paper, webelieve that the idea of automatically resolving forum ques-tions is general and can be potentially useful to many otherdomains. Utilizing both thread and forum structure is ageneral technique that is applicable to other types of webforums. Furthermore, for the purpose of retrieving relevantforum threads, focusing on classifying the sentence typesrather than on identifying entities may be useful in otherdomains as well.

Our works can be extended by investigating how modelingintents can help improve case retrieval performance. Therehas been research done in analyzing the type of questionsclinicians [11] and patients [5] often ask in clinical settings.Extending these line of works to model what type of medicalquestions are asked in medical forums will help retrieval per-formance and better aid users. Studying frequent medicalforum post patterns may be another interesting future di-rection. We have exploited only the position of the posts todetermine the weight. An in-depth analysis of modeling userinteractions may further improve retrieval performance.

8. ACKNOWLEDGMENTSThis work is supported in part by the National Science

Foundation under Grant Number CNS-1027965. We wouldalso like to thank the anonymous reviewers for their invalu-able feedback.

9. REFERENCES[1] The metamap transfer toolkit. http://mmtx.nlm.nih.gov/.[2] A. Aamodt and E. Plaza. Case-based reasoning:

foundational issues, methodological variations, and systemapproaches. AI Commun., 7(1):39–59, Mar. 1994.

[3] E. Agichtein, Y. Liu, and J. Bian. Modelinginformation-seeker satisfaction in community questionanswering. ACM Trans. Knowl. Discov. Data,3(2):10:1–10:27, Apr. 2009.

[4] S. Bhatia and P. Mitra. Adopting inference networks foronline thread retrieval. Proceedings of the 24th AAAI,pages 1300–1305, 2010.

[5] C. R. Boot and F. J. Meijman. Classifying health questionsasked by the public using the ICPC-2 classification and ataxonomy of generic clinical questions: an empiricalexploration of the feasibility. Health Commun,25(2):175–181, Mar 2010.

[6] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson. Selectinggood expansion terms for pseudo-relevance feedback. InProceedings of the 31st annual international ACM SIGIRconference on Research and development in informationretrieval, SIGIR ’08, pages 243–250, New York, NY, USA,2008. ACM.

[7] B. W. Chee, R. Berlin, and B. Schatz. Predicting adversedrug events from personal health messages. AMIA AnnuSymp Proc, 2011:217–226, 2011.

[8] K. Chen, T. Chen, G. Zheng, O. Jin, E. Yao, and Y. Yu.Collaborative personalized tweet recommendation. In

Proceedings of the 35th international ACM SIGIRconference on Research and development in informationretrieval, SIGIR ’12, pages 661–670, New York, NY, USA,2012. ACM.

[9] J. H. D. Cho, V. Q. Liao, Y. Jiang, and B. R. Schatz.Aggregating Personal Health Messages for ScalableCcomparative Effectiveness Research. In Proceedings of theACM Conference on Bioinformatics, ComputationalBiology and Biomedicine, BCB ’13. ACM, 2013.

[10] J. L. Elsas and J. G. Carbonell. It pays to be picky: anevaluation of thread retrieval in online forums. InProceedings of the 32nd international ACM SIGIRconference on Research and development in informationretrieval, SIGIR ’09, pages 714–715, New York, NY, USA,2009. ACM.

[11] J. W. Ely, J. A. Osheroff, P. N. Gorman, M. H. Ebell,M. L. Chambliss, E. A. Pifer, and P. Z. Stavri. A taxonomyof generic clinical questions: classification study. BMJ,321(7258):429–432, Aug 2000.

[12] J. R. Finkel, T. Grenager, and C. Manning. Incorporatingnon-local information into information extraction systemsby gibbs sampling. In Proceedings of the 43rd AnnualMeeting on Association for Computational Linguistics,ACL ’05, pages 363–370, Stroudsburg, PA, USA, 2005.Association for Computational Linguistics.

[13] J. Hannon, M. Bennett, and B. Smyth. Recommendingtwitter users to follow using content and collaborativefiltering approaches. In Proceedings of the fourth ACMconference on Recommender systems, RecSys ’10, pages199–206, New York, NY, USA, 2010. ACM.

[14] M. Hua, M. Alfi, and P. Talbot. Health-related effectsreported by electronic cigarette users in online forums. J.Med. Internet Res., 15(4):e59, 2013.

[15] J. Jeon, W. B. Croft, and J. H. Lee. Finding similarquestions in large question and answer archives. InProceedings of the 14th ACM international conference onInformation and knowledge management, CIKM ’05, pages84–90, New York, NY, USA, 2005. ACM.

[16] Y. Jiang, Q. V. Liao, Q. Cheng, R. B. Berlin, and B. R.Schatz. Designing and evaluating a clustering system fororganizing and integrating patient drug outcomes inpersonal health messages. AMIA Annu Symp Proc,2012:417–426, 2012.

[17] V. Jijkoun and M. de Rijke. Retrieving answers fromfrequently asked questions pages on the web. In Proceedingsof the 14th ACM international conference on Informationand knowledge management, CIKM ’05, pages 76–83, NewYork, NY, USA, 2005. ACM.

[18] K. . Jones and C. J. van Rijsbergen. Report on the need forand provision of an “ideal” information retrieval testcollection. Technical Report British Library Research andDevelopment Report 5266, Computer Laboratory,University of Cambridge, 1975.

[19] L. Li, W. Chu, J. Langford, and R. E. Schapire. Acontextual-bandit approach to personalized news articlerecommendation. In Proceedings of the 19th internationalconference on World wide web, WWW ’10, pages 661–670,New York, NY, USA, 2010. ACM.

[20] N. Limsopatham, C. Macdonald, R. McCreadie, andI. Ounis. Exploiting term dependence while handlingnegation in medical search. In Proceedings of the 35thinternational ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’12, pages1065–1066, New York, NY, USA, 2012. ACM.

[21] N. Limsopatham, C. Macdonald, and I. Ounis. Inferringconceptual relationships to improve medical records search.In Proceedings of the 10th Conference on Open ResearchAreas in Information Retrieval, OAIR ’13, pages 1–8,Paris, France, France, 2013.

[22] J. Liu, P. Dolan, and E. R. Pedersen. Personalized newsrecommendation based on click behavior. In Proceedings ofthe 15th international conference on Intelligent user

interfaces, IUI ’10, pages 31–40, New York, NY, USA,2010. ACM.

[23] Q. Liu, E. Agichtein, G. Dror, Y. Maarek, and I. Szpektor.When web search fails, searchers become askers:understanding the transition. In Proceedings of the 35thinternational ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’12, pages801–810, New York, NY, USA, 2012. ACM.

[24] Y. Lv and C. Zhai. A comparative study of methods forestimating query language models with pseudo feedback. InProceedings of the 18th ACM conference on Informationand knowledge management, CIKM ’09, pages 1895–1898,New York, NY, USA, 2009. ACM.

[25] D. L. Maclean and J. Heer. Identifying medical terms inpatient-authored text: a crowdsourcing-based approach. JAm Med Inform Assoc, May 2013.

[26] M. McCandless, E. Hatcher, and O. Gospodnetic. Lucenein Action, Second Edition: Covers Apache Lucene 3.0.Manning Publications Co., Greenwich, CT, USA, 2010.

[27] K. Pattabiraman, P. Sondhi, and C. Zhai. Exploiting forumthread structures to improve thread clustering. ICTIR ’13,2013.

[28] G. Quellec, M. Lamard, L. Bekri, G. Cazuguel, C. Roux,and B. Cochener. Medical case retrieval from a committeeof decision trees. IEEE Trans Inf Technol Biomed,14(5):1227–1235, Sep 2010.

[29] S. E. Robertson and K. Sparck Jones. Document retrievalsystems. chapter Relevance weighting of search terms,pages 143–160. Taylor Graham Publishing, London, UK,UK, 1988.

[30] S. E. Robertson, S. Walker, and M. Beaulieu. Okapi atTREC-7: Automatic ad hoc, filtering, VLC and interactivetrack.

[31] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.Item-based collaborative filtering recommendationalgorithms. In Proceedings of the 10th internationalconference on World Wide Web, WWW ’01, pages285–295, New York, NY, USA, 2001. ACM.

[32] J. Seo, W. B. Croft, and D. A. Smith. Online communitysearch using thread structure. In Proceedings of the 18thACM conference on Information and knowledgemanagement, CIKM ’09, pages 1907–1910, New York, NY,USA, 2009. ACM.

[33] A. Singh, D. P, and D. Raghu. Retrieving similar discussionforum threads: a structure based approach. In Proceedingsof the 35th international ACM SIGIR conference onResearch and development in information retrieval, SIGIR’12, pages 135–144, New York, NY, USA, 2012. ACM.

[34] P. Sondhi, M. Gupta, C. Zhai, and J. Hockenmaier. Shallowinformation extraction from medical forum data. InProceedings of the 23rd International Conference onComputational Linguistics: Posters, COLING ’10, pages1158–1166, Stroudsburg, PA, USA, 2010. Association forComputational Linguistics.

[35] A. Sorokin and D. Forsyth. Utility data annotation withAmazon Mechanical Turk. In Computer Vision andPattern Recognition Workshops, 2008. CVPRW 08. IEEEComputer Society Conference on, pages 1–8. IEEE, June2008.

[36] X. Xue, J. Jeon, and W. B. Croft. Retrieval models forquestion and answer archives. In Proceedings of the 31stannual international ACM SIGIR conference on Researchand development in information retrieval, SIGIR ’08, pages475–482, New York, NY, USA, 2008. ACM.