a method for mining infrequent causal associationsand its application in finding adverse drug...

14
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 A Method for Mining Infrequent Causal Associations and Its Application in Finding Adverse Drug Reaction Signal Pairs Yanqing Ji, Hao Ying, Fellow, IEEE, John Tran, Peter Dews, Ayman Mansour, R. Michael Massanari Abstract—In many real-world applications, it is important to mine causal relationships where an event or event pattern causes certain outcomes with low probability. Discovering this kind of causal relationships can help us prevent or correct negative outcomes caused by their antecedents. In this paper, we propose an innovative data mining framework and apply it to mine potential causal associations in electronic patient datasets where the drug-related events of interest occur infrequently. Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the basis of this new measure, a data mining algorithm was developed to mine the causal relationship between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was employed to rank the potential causal associations between each of the three selected drugs (i.e., enalapril, pravastatin and rosuvastatin) and 3,954 recorded symptoms, each of which corresponded to a potential ADR. The top 10 drug-symptom pairs for each drug were evaluated by the physicians on our project team. The numbers of symptoms considered as likely real ADRs for enalapril, pravastatin and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of our method in finding potential ADR signal pairs for further analysis (e.g., epidemiology study) and investigation (e.g., case review) by drug safety professionals. Index Terms— Adverse drug reactions, association rules, data mining algorithms, interestingness measure, recognition primed decision model. —————————— —————————— 1 INTRODUCTION INDING causal associations between two events or sets of events with relatively low frequency is very useful for various realȬworld applications. For examȬ ple, a drug used at an appropriate dose may cause one or more adverse drug reactions (ADRs), although the probȬ ability is low. Discovering this kind of causal relationȬ ships can help us prevent or correct negative outcomes caused by its antecedents. However, mining these relaȬ tionships is challenging due to the difficulty of capturing causality among events and the infrequent nature of the events of interest in these applications. In this paper, we try to employ a knowledgeȬbased approach to capture the degree of causality of an event pair within each sequence since the determination of cauȬ sality is often ultimately application or domain depenȬ dent. We then develop an interestingness measure that incorporates the causalities across all the sequences in a database. Our study was motivated by the need of discoȬ vering ADR signals in postȬmarketing surveillance, even though the proposed framework can be applied to many different applications. ADRs represent a serious worldȬ wide problem [1, 2]. They can complicate a patient’s medȬ ical condition or contribute to increased morbidity, even death. Studies have shown that ADRs contribute to about 5% of all hospital admissions and represent the 5 th comȬ monest cause of death in hospitals [3]. Even though preȬmarketing clinical trials are required for all new drugs before they are approved for marketing, these trials are necessarily limited in sampleȬsize and duȬ ration, and thus are not capable of detecting rare ADRs. In general, an ADR cannot be recognized by these trials if its occurrence rate is less than 0.1% [1].Therefore, drug safety depends heavily on postȬmarketing surveillance; that is, the monitoring of impacts of medicines once they have been made available to consumers. In the U.S., curȬ rent postȬmarketing surveillance methods primarily rely on the FDA’s spontaneous reporting system MedȬ Watch TM . Because ADR reports are filed at the discretion of the users of the system, there is gross underreporting xxxx-xxxx/0x/$xx.00 © 200x IEEE ———————————————— x Y. Ji is with the Department of Electrical and Computer Engineering, Gonzaga University, Spokane, WA 99258 USA Email: [email protected]. x H. Ying is with the Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI 48202, USA Email: [email protected]. x J. Tran is with the Spokane Mental Health, Spokane, WA 99202, USA Email: [email protected]. x P. Dews is with the Department of Medicine, St. Mary Mercy Hospital (Trinity Health), Livonia, MI 48154, USA Email:dewsp@trinityȬealth.org. x A. Mansour is with Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI 48202, USA Email: [email protected]. x R. Michael Massanari is with the Research for the Critical Junctures InstiȬ tute, Bellingham, WA 98225, USA Email: [email protected]. Manuscript received (insert date of submission if desired). F Digital Object Indentifier 10.1109/TKDE.2012.28 1041-4347/12/$31.00 © 2012 IEEE This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Upload: rubika-shanmugham

Post on 20-Oct-2015

52 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

A Method for Mining Infrequent Causal Associations and Its Application in Finding

Adverse Drug Reaction Signal PairsYanqing Ji, Hao Ying, Fellow, IEEE, John Tran, Peter Dews, Ayman Mansour, R. Michael

Massanari

Abstract—In many real-world applications, it is important to mine causal relationships where an event or event pattern causes certain outcomes with low probability. Discovering this kind of causal relationships can help us prevent or correct negative outcomes caused by their antecedents. In this paper, we propose an innovative data mining framework and apply it to mine potential causal associations in electronic patient datasets where the drug-related events of interest occur infrequently. Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the basis of this new measure, a data mining algorithm was developed to mine the causal relationship between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was employed to rank the potential causal associations between each of the three selected drugs (i.e., enalapril, pravastatin and rosuvastatin) and 3,954recorded symptoms, each of which corresponded to a potential ADR. The top 10 drug-symptom pairs for each drug were evaluated by the physicians on our project team. The numbers of symptoms considered as likely real ADRs for enalapril, pravastatin and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of our method infinding potential ADR signal pairs for further analysis (e.g., epidemiology study) and investigation (e.g., case review) by drugsafety professionals.

Index Terms— Adverse drug reactions, association rules, data mining algorithms, interestingness measure, recognition primed decision model.

—————————— ——————————

1 INTRODUCTION

INDING causal associations between two events orsets of events with relatively low frequency is veryuseful for various real world applications. For exam

ple, a drug used at an appropriate dose may cause one ormore adverse drug reactions (ADRs), although the probability is low. Discovering this kind of causal relationships can help us prevent or correct negative outcomescaused by its antecedents. However, mining these relationships is challenging due to the difficulty of capturingcausality among events and the infrequent nature of theevents of interest in these applications.

In this paper, we try to employ a knowledge basedapproach to capture the degree of causality of an event

pair within each sequence since the determination of causality is often ultimately application or domain dependent. We then develop an interestingness measure thatincorporates the causalities across all the sequences in adatabase. Our study was motivated by the need of discovering ADR signals in post marketing surveillance, eventhough the proposed framework can be applied to manydifferent applications. ADRs represent a serious worldwide problem [1, 2]. They can complicate a patient’s medical condition or contribute to increased morbidity, evendeath. Studies have shown that ADRs contribute to about5% of all hospital admissions and represent the 5th commonest cause of death in hospitals [3].

Even though pre marketing clinical trials are requiredfor all new drugs before they are approved for marketing,these trials are necessarily limited in sample size and duration, and thus are not capable of detecting rare ADRs.In general, an ADR cannot be recognized by these trials ifits occurrence rate is less than 0.1% [1].Therefore, drugsafety depends heavily on post marketing surveillance;that is, the monitoring of impacts of medicines once theyhave been made available to consumers. In the U.S., current post marketing surveillance methods primarily relyon the FDA’s spontaneous reporting system MedWatchTM. Because ADR reports are filed at the discretionof the users of the system, there is gross underreporting

xxxx-xxxx/0x/$xx.00 © 200x IEEE

————————————————

Y. Ji is with the Department of Electrical and Computer Engineering,Gonzaga University, Spokane, WA 99258 USA Email: [email protected]. Ying is with the Department of Electrical and Computer Engineering,Wayne State University, Detroit, MI 48202, USA Email:[email protected]. Tran is with the Spokane Mental Health, Spokane, WA 99202, USAEmail: [email protected]. Dews is with the Department of Medicine, St. Mary Mercy Hospital(Trinity Health), Livonia, MI 48154, USA Email:dewsp@trinity ealth.org.A. Mansour is with Department of Electrical and Computer Engineering,Wayne State University, Detroit, MI 48202, USA Email:[email protected]. Michael Massanari is with the Research for the Critical Junctures Institute, Bellingham, WA 98225, USA Email: [email protected].

Manuscript received (insert date of submission if desired).

F

Digital Object Indentifier 10.1109/TKDE.2012.28 1041-4347/12/$31.00 © 2012 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 2: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

[4, 5]. Consequently, the current approach may requireyears to identify and withdraw problematic drugs fromthe market, and result in unnecessary mortality, morbidity, and cost of healthcare. Studies have shown that onlyhalf of newly discovered serious ADRs are detected anddocumented in the Physician’s Desk Reference within 7years after drug approval [6].

Systematic methods for the detection of suspectedsafety problems from spontaneous reports have been studied and practically implemented. For example, the FDAcurrently adopts a data mining algorithm called Multiitem Gamma Poissson Shrinker [7] for detecting potentialsignals from its spontaneous reports. Another importantsignal detection strategy is known as the Bayesian Confidence Propagation Neural Network that has been used bythe Uppsala Monitoring Center in routine pharmacovigilance with its World Health Organization database [8].Various other methods such as proportional reportingratios [9], empirical Bayes screening [10], and reportingodds ratios [11] have been used in the spontaneous reporting centers of other nations (e.g., England and Australian). These methods have shown better performancethan traditional methods. However, the performance ofthese techniques could be highly situation dependent dueto the weaknesses and potential biases inherent in spontaneous reporting [12].

As electronic patient records become more and moreeasily accessible in various health organizations such ashospitals, medical centers, and insurance companies, theyprovide a new source of information that has great potential to generate ADR signals much earlier [13, 14]. Notethat each patient case can be considered as an event sequence where various events such as drug prescription,occurrence of a symptom and lab test occur at differenttimes. In the literature, there exist a couple of studies [14,15] that attempted to find the associations between drugsand potential ADRs by mining their temporal relationships. That is, they tried to mine temporal associationrules (represented as TX Y) where Y occurs after Xwithin a time window of length T . These studies obtained promising results based on administrative healthdata. However, temporal association was the only parameter used for linking a symptom with a drug withineach patient case in their work. Temporal association assumes that cause precedes effect. Other parameters suchas dechallenge and rechallenge can also give direct orindirect cues of the potential causal association of a drugsymptom pair. Dechallenge is defined as the relationshipbetween withdrawal of the drug and abatement of theadverse effect. Rechallenge describes the relationship between re introduction of the drug followed by recurrenceof the adverse event. In addition, their approaches sufferfrom the sharp boundary problem. On the one hand, thesymptom events near the time boundaries are either ignored or overemphasized. On the other hand, two symptom events contribute equally to the interestingnessmeasure as long as they occur within the hazard period T.That is, the length of the time duration between exposureto the drug and occurrence of the symptom has no effecton the interestingness measure. This is not true in reality

because if an ADR symptom occurs within a shorter period, it is usually more likely to be caused by the drug.

To more effectively mine infrequent causal associations, it is necessary to develop a new data miningframework. The current paper is a substantial extensionof our previous work [16, 17] where an interestingnessmeasure called causal leverage was developed on the basisof a computational fuzzy recognition primed decision(RPD) model [18] we previously developed . In thepresent paper, we focus on mining infrequent causal associations. We significantly extended our previous work inboth theories and experiments:

(1) We developed and incorporated an exclusion mechanism that can effectively reduce the undesirable effectscaused by frequent events. Our new measure is namedexclusive causal leveragemeasure.

(2) We proposed a data mining algorithm to mine ADRsignal pairs from electronic patient database based on thenew measure. The algorithm’s computational complexityis analyzed.

(3) We compared our new exclusive causal leveragemeasure with our previously proposed causal leveragemeasure as well as two traditional measures in the literature: leverage and risk ratio.

(4) To establish the superiority of our new measure, wedid extensive experiments. In our previous work, wetested the effectiveness of the causal leverage measure using a single drug in the experiment. In the current paper,we selected three drugs and evaluated the top 10 ICD 9(International Classification of Diseases, 9th Revision)codes ranked by the exclusive causal leverage measure foreach drug. We also tested how the length of hazard period T affects the performance of the exclusive causalleveragemeasure.

2 RELATED WORK IN THE LITERATURE

Discovering the association between two event sets is anactive and important area of data mining research [19 22].Many measures have been proposed to mine associationrules in the form of X Y , where X and Y are twoevent sets and they are disjoint (i.e., X Y ). For instance, the support and confidence measures were the original interestingness measures proposed for associationrules [23]. The support of an association rule supp( X Y )is the proportion of sequences in which both X and Y occur at least once, among all the event sequences. The confidence of an association rule is defined as conf( X Y ) =supp( X Y ) /supp( X ), where supp( X ) is the proportion of sequences that contain X . Given these twomeasures, the association rule mining problem can beformalized as finding those rules whose support and confidence are greater than pre specified thresholds minsuppand minconf, respectively. In the literature, researchershave also proposed various other interestingness measures such as IS [24], Klosgen’s measure [25], weightedrelative accuracy [26], and so on. Originated from areassuch as probability, statistics or information theory, thesetraditional interestingness measures are primarily designed for finding frequent association rules. Thus, they

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 3: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 3

have limited capability in discovering patterns that occurinfrequently.

There exist some works on rare eventset (or itemset)mining in the literature [27 35]. A straightforward approach to discovering infrequent eventsets is to relax theuniform minimal support criterion (i.e. minsupp) as indicated in [27]. One drawback of this mining approach isthe extremely high computational cost [28, 29, 36]. In addition, if this approach was used to discover rare eventslike ADRs, thresholds minsupp and minconf would have tobe set very small, possibly leading to a lot of false associations [14].

A couple of studies attempted to use a less uniformsupport criteria [28 30]. Yun et al. adopted a relative support (a rate against the relative frequency of the data existing in a database) approach [29], while Liu and his colleagues proposed a method that relied on multiple minimal support thresholds specified by users [28]. Theseapproaches output all frequent eventsets and associationrules together with a subset of all infrequent ones.

Several other studies explored rare eventsets (rules)with support lower than the threshold [31 35]. These studies implemented different strategies to traverse thepower set lattice of a dataset. For example, some of themtake a bottom up approach to moving across the lattice[31, 32, 35], while another algorithm named Rarity (proposed by Troiano et al.) takes a top down method [33].Koh and Rountree proposed an algorithm called AprioriInverse where sporadic rules are discovered by simplydiscarding all eventsets above a support threshold [34].

The above approaches to detection of rare associationrules are based on traditional interestingness measureslike support and confidence. One pitfall of these measuresis that they simply find the statistical correlation betweenX and Y . For example, the confidence of an associationrule X Y determines how frequently Y appears inthose event sequences that contain X . That is, traditionalmeasures like confidence try to find the statistical significance of the co existence of two eventsets. They do notindicate any temporal relationship between X and Y . Inaddition, they are not able to capture the causal relationships between two event sets. They do not specify whether X causes Y , or vice versa, or a third event causes theco existence of X and Y . Hence, statistical associationsestablished by traditional measures may or may notrepresent temporal or causal associations.

Causal modeling and inference have been widely studied in the field of machine learning and traditionalprobability and statistics theory [37 39]. Compared withstatistical co existence, causal relationships have a coupleof unique properties. First, there is an intrinsic asymmetryin a cause effect relationship. That is, when saying eventX causes event Y , one would not expect X to be influenced by Y . Second, the concept of causality is linked totime dependencies. That is, the causes must precede theireffects. Researchers have proposed various models tomodel causal relationships for the purpose of predictionor data modeling (i.e. estimation of a probability law thathas generated the data). These models include artificialneural networks [40], Markov models [41], various graph

ical models [38, 39], etc. One of the most famous models isPearl’s causal model where causal modeling and inferences are based on representations by means of DirectedAcyclic Graphs (DAGs) [38]. For a more extensive anddetailed review about the state of the art of causality,readers are referred to a recent survey paper [37].

With a surge of interest in association rule mining inrecent years, some of the above models have been borrowed to mine the causal relationships between twoevents or eventsets. For instance, Bayesian network andits variants have been adapted to mine causal structuresin a couple of studies [42, 43]. One criticism of Bayesiananalysis is its requirement to specify prior distributionsfor all variables, which can be very difficult since, inmany cases, prior knowledge is vague or non existent, orcan be tedious if the number of variables is large [37, 38].Silverstein, et al., proposed a constraint based algorithmthat narrowed down the search space and thus made themining process more scalable [39]. In addition, HiddenMarkov Models (HMMs) have been utilized to predictprotein structures and analyze genome sequences basedon massive amounts of observed data [44, 45]. However,these approaches are inherently probability based anddesigned to find frequent patterns. Thus, they have limited capability in discovering infrequent associations.

3 BACKGROUND ON FUZZY RPD MODEL

The RPD model [40, 46] represents a popular cognitivedecision model. It is particularly useful for modeling howhuman experts make decisions based on their prior experiences. It was found that about 50% to 80% of all decisions were made in this way [41]. The original RPD modelis descriptive and is not directly implementable on acomputer. Hence, we developed a fuzzy RPD model [18],which is not only computational but also capable of handling vague and subjective information using fuzzy logic[47, 48].

Experiences play a key role in the RPD model. TheADR detection experiences were acquired through thejoint efforts of our engineering and medical team members after careful analysis of the relevant literature. According to the classification scheme in [49], a particularpattern of cue values characterizes a specific degree ofcausality which may require certain courses of action tohandle the ADR. Therefore, we can define four experiences, each of which is associated with a degree of causality (i.e., very likely, probable, possible and unlikely).These experiences were stored in an experience knowledge base. Fig. 1 presents a sample experience illustratedin natural language for easier understanding.

As shown, an experience consists of four componentscues, goals, actions, and expectancies. Cues represent thehigher level information (synthesized from elementary orenvironmental data) that a decision maker must pay attention to. Expectancies describe what will happen nextas the current situation continues to evolve in a changingcontext. Goals represent an end state that the decisionmaker is trying to achieve. Actions represent a set of potential decisions that the decision maker can take in the

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 4: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

current situation. Cues are used to match the current situation with prior experiences and determine which experience can be reused to solve a new problem. This sampleexperience has four cues: temporal association, dechallenge, rechallenge, and other explanation. The first threecues have been explained in Introduction. Other explanations denote alternative explanations by concurrent disease or other drugs. The type of cue could be quantitative,nominal or fuzzy in the proposed computational fuzzyRPD model. For instance, the cue ‘temporal association’may have fuzzy values (e.g. unlikely, possible, likely).The weights for these cues are design parameters and areassigned by domain experts. Table 1 shows how the fourcues are related to degree of causality of a signal pair inthe four experiences. These mappings were given by thephysicians in our research project. For instance, whencues temporal association, rechallenge and dechallengehave fuzzy cue values possible, unlikely, and possible, respectively, and there is no other explanations, this cuevalue pattern represents a possible causal association between the drug of interest and the suspected ADR fromthe perspective of a physician. In general, these mappingsare determined by domain experts.

After representing the experiences, the next step is toextract the cue values from elementary data for the current situation (e.g. a new patient case in which a drugsymptom pair need to be evaluated). Fuzzy rules andfuzzy reasoning are used to achieve this task. For example, to obtain the fuzzy value of temporal association, oneof the fuzzy rules of the type “if the time duration between taking the drug and the occurrence of the adverseevent is short, then temporal association is likely” is used.An embedded fuzzy inference engine is employed to perform fuzzy reasoning. The inference engine is what drivesthe RPD model, updating cues once new information isdetected, and monitoring expectancies and goals. Afterthe cue values of the current situation are known, similarity measures are applied to measure the degree of matching between the current situation and prior experiences.The actions in the most matching experience are thenused to solve the current problem. For more details of thefuzzy RPD model as well as concrete examples, the readeris referred to our previous work [18].

The fuzzy RPD model was preliminarily validated inour previous study [18] by using it to calculate the extent

of causality between cisapride and some of its adverseeffects for 100 simulated patients created based on theprofiles of 1,015 patients of the VA Medical Center. Thesimulated patients were used because we found that thenumber of apparent signal pairs of potential interestingsymptoms (i.e., cisapride and Torsades de Pointe and/orsyncope) in the real patient data was very limited. Therefore, we used the real patients to create simulated patientcases, all of which containing drug symptom pairs of interest with various degrees of causality. The model’s validity was then established by comparing the decisionsmade by the model and those by two independent experienced physicians for the 100 simulated patients. Thelevels of agreements were measured by the weightedKappa statistic, which is a measure of agreement betweentwo raters after chance agreement is controlled. The results suggested good to excellent agreements [50].

4 DEVELOPING A NEW MEASURE AND APPLYING IT TO SEARCHING ADR SIGNAL PAIRS

In this section, we first introduce a new measure we recently developed. We then discuss how to mine potentialADRs from electronic patient records using this measure.

4.1 A New Exclusive Causal-Leverage Measure We use <X,Y> and C<X,Y> to represent a pair of events andthe degree of causality of the pair in a sequence, respectively. C<X,Y> is calculated by matching the cue values extracted from a sequence with the cue values of the defined experiences, each of which corresponds to a uniquecategory of causality (i.e., very likely, probable, possible,and unlikely). Specifically, we use a vectorV=(c1,c2,…,ci,…,cm) to represent the set of cue values extracted from an event sequence, where ci represents thevalue of ith cue and m is the total number cues. Similarly,we use V’ = (c1’,c2’,…,ci’,…,cm’) to represent the set of cuevalues in an experience. We define the similarity betweena pair of cue values ci and ci’ as local similarity SL(ci, ci’)whose calculation depend on the type of the cue (i.e.quantitative, nominal or fuzzy); that is,

Fig. 1. A sample experience.

TABLE 1RELATING CUES TO CAUSALITY CATEGORY OF A SINGLE PAIR

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 5: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 5

,1

m

X Y i ii

C w

'

''

'

'

0,

( , ),( , )

1 _ ( , ),

_ ( , ),

i i

i iL i i

i i

i i

if c or c is unknown

overlap c c if cue i is nominalS c c

normalized diff c c if cue i is quantitative

fuzzy sim c c if cue i is fuzzy value

(1)

In the above definition, the cue similarity is set 0 if either of the cue values is unknown. The function overlap()is defined as:

'' 1,

( , )0,

i ii i

if c coverlap c c

otherwise (2)

That is, for nominal cue values, the similarity is 1 if thecue values are equal; otherwise it is 0. Computing thesimilarity for quantitative cue values involves distancemeasure normalized_diff :

''_ ( , ) i i

i ii

c cnormalized diff c c (3)

where i i ia b is employed to normalize the cue difference, and ia and ib are the maximum and minimumvalues for cue i, respectively.

The last expression in the above formula,'_ ( , )i ifuzzy sim c c , deals with cues represented by fuzzy

sets. Let cue i be defined on the universe of discourse Xand x be an element of X. In this case, fuzzy cue values

ic and ic are two fuzzy sets whose membership functions are defined in terms of x . Their similarity could bedefined on the basis of possibility measures on the [0, 1]interval [51]:

', , , 0.5

_ ( , )1.5 , , ,

i i i i

i i

i i i i

poss c c if poss c cfuzzy sim c c

poss c c poss c c otherwiese(4)

where

', max min ( ), ( )i i

i i c cxposs c c x x x X (5)

Here, ( )ic x and ' ( )

icx are the membership functions of

fuzzy set ic and ic , respectively. The min() and max()are standard fuzzy operators, and hence

'max min ( ), ( )i i

c cxx x computes the maximum of all the

minimums between pairs ( )ic x and ' ( )

icx for all the

elements x . ic is the complement of ic (i.e.,( ) 1 ( )

iicc

x x ).The similarity between two sets of cue values V and V’

is named as global similarity SG(V, V’). It is defined as theweighted sum of all the local similarities with respect toeach pair of cue values. That is,

'

1

1

( , ),

n

i L i ii

G n

ii

w S c cS V V

w(6)

where 0,1iw is the weight for cue i, which representsthe relative significance of the cue and is assigned by theuser.

The above global similarity is used to find the mostmatching experience whose associated causality categorycan characterize the causal association of a pair of interestin a particular event sequence. We use this approach toobtain a similarity value between the current pair andeach of the experiences. After that, these similarity valuesare normalized so that their sum is equal to 1. These normalized values are then used to represent the membership values of corresponding categories for the pair ofinterest in a particular sequence. For example, if the normalized similarity value between a pair and the experience exhibiting “possible” association is 0.6, we wouldsay the causality of the pair is “possible” with a membership value of 0.6 in that sequence. If we use d and s torepresent a drug and a symptom, respectively, Table 2gives a couple of sample drug symptom pairs as well astheir membership values related to each causality category for the patient cases that contain the pairs. Note that,given a particular drug symptom pair, there may existnone, one, or multiple patient cases that contain this pair.On the other hand, the same patient case may containmultiple different pairs. Patient 4 in the table containsboth <d1, s2> and < d3, s5>. C<X,Y> within one patient case isequal to the weighted sum of the membership values ofall causality categories.

In general, if there exist m experiences that classifiesthe causality between X and Y into m distinctive categories, the degree of causality is defined as

(7)

where i is the membership of the thi causality categoryfor the pair, and wi represent the corresponding weightwhen converging all the causality categories into one.Moreover,

The selection of wi is based on two considerations. First,causality categories representing stronger causal associations should have higher weights. That is, if we assumethe causality levels represented by m experience are in a

TABLE 2MEMBERSHIP OF EACH CAUSALITY CATEGORY FOR SAMPLE

SIGNAL PAIRS (PID MEANS PATIENT IDENTIFICATION NUMBER)

11

m

ii

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 6: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

decreasing order, then wm > > w2 > w1 must be satisfied,where wm and w1 correspond to the highest and lowestcausality levels, respectively. Second, the range of C<X,Y>should be [0, 1]. That is, C<X,Y> should be 0 for the extremesituation ={0, 0,1} where the evidence in a sequencestrongly shows “unlikely” association of the pair. If all theevidence in the patient supports “very likely” association(i.e. ={1,0 0}), C<X,Y> should be 1. Otherwise, C<X,Y> isbetween 0 and 1. In general, the weight for the thi causality category is calculated by

(8)

where 1m i . For example, if there are 4 causality categories (i.e. m=4), then the set of weights for them wouldbe 1.0, 0.667, 0.333, 0.0 . One can see that the computedC<X,Y> using these weights satisfies the above two criterions.

We introduce causal association rule (CAR), denotedby CX Y , meaning that X potentially causes Y . Wedefine the support of a CAR, supp( CX Y ), as the accumulated votes over all sequences. The vote of the thj sequence is represented by C j<X,Y>. That is

supp( CX Y ) = (9)

where n is the number of sequences that contain the pair.N is the total number of sequences in the database. Theadvantage of this definition is that the degree of causalityis incorporated into the measure such that sequences exhibiting higher causality will contribute more to themeasure.

Based on the above definition of the support of a CAR,we define an interestingness measure called causalleverage:

causal leverage( CX Y )= supp ( CX Y ) supp ( CX ) supp ( Y ) (10)

where supp ( CX ) is the proportion of sequences whosevotes are greater than 0 with respect to the pair <X,Y>.supp( Y ) is the proportion of sequences that containY .This measure extends the traditional leverage measure(i.e., leverage( X Y )=supp( X Y ) supp( X ) supp( Y))by incorporating the degree of causality of the pairamong all sequences into it.

With better capability of capturing causal associations,the causal leverage measure can be employed to mine thecausal association of pair <X,Y> given a collection of eventsequences. However, its ability of discovering the causalassociations between drugs and their potential ADRs isstill limited because of the complexity and special characteristics of electronic patient data. Recall that the degreeof causal association of a pair within a particular patientcase is determined by four cues: temporal association,rechallenge dechallenge, and other explanations. One cansee that three of them are time related. For a particular

pair, their values can be abstracted from a specific patientcase using fuzzy rules and these values collectively determine the degree of causal association of the pair in thepatient case. However, the obtained cue values may notreally represent the actual causal association of the pairdue to the complexity of the electronic patient data. Oneof the complexities is that frequently prescribed drugs areeasily temporally associated with any symptom by coincidence. For example, if the hazard period T is equal to 90days and the start date of a drug appears once every twomonths within the same case, then the drug has temporalassociation with any symptom existed in the patient case.Similarly, a particular drug tends to coincidently formtemporal association with symptoms (e.g. fever) causedby common diseases. Note that symptoms of a potentialADR cannot be differentiated from those of an underlyingdisease, both of which are encoded using the same ICD 9codes in electronic patient databases. For a drugsymptom pair, the associated symptom is possibly causedeither by a drug or by a disease after the drug is started. Ifboth a drug and a symptom have high frequencies in ahealth database, they can even easily form rechallenge bycoincidence. As a result, the obtained causal leverage valuecan be high for the pair even though this value does notimply any degree of causality. Therefore, it is necessary tofind an exclusion mechanism to reduce the undesirableeffects caused by high frequent events (e.g., either drugsor symptoms).

Intuitively, if a high frequent event Y randomly occursafter another event X within time period T, then itshould have the same chance to also occur before Xwithin the same period T among many event sequences.Case1 in Fig. 2 demonstrates a pattern where the symptom s1 occurs before the drug d1 within time period T. Areasonable explanation of this pattern is that the underlying disease of the symptom s1 causes the prescription ofthe drug d1. Case2 indicates a pattern where the symptoms1 occurs after the drug d1 within time period T. If we lookat Case2 only, a reasonable explanation could be that thesymptom s1 represents a possible ADR caused by thedrug d1 since they exhibit reasonable temporal association. However, if we look at these two patient cases to

Fig. 2. Illustration of patient cases that exhibit different types of event patterns.

11i

iwm

,1

nj

X Yj

C

N

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 7: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 7

gether, it is more likely that the symptom s1 simply occursrandomly. In this case, both the pair <d1, s1> and its reversed pair <s1, d1> exhibit temporal associations in various patient cases. In some cases, the above two patternscan coexist in the same patient case as shown in Case3. Forsuch a pattern, there exist several possible explanations.First, d1 is caused by the first s1 and the second s1 occurscoincidently. Second, the second s1 is caused by d1 and thefirst s1 occurs coincidently. Third, s1 and d1 simply occurrandomly without any causal association. A more realisticpatient case can even exhibit a more complex pattern asshown in Case4. It denotes that d1 has temporal and rechallenge relationship with s1 since their temporal association occurs twice along the time. However, this patterncan be explained differently. For example, the second d1occurs after the first s1, which can be explained as the underlying disease of the symptom causing the prescriptionof the drug. The temporal association between s1 and d1also occurs twice and they also exhibit rechallenge relationship. Considering all these four types of patterns together, if drugs and/or symptoms occur randomly andfrequently, they may falsely form temporal associationand rechallenge relationships. However, the number ofcases exhibiting these relationships for a pair <drug, symptom> and that for its reversed pair (i.e., <symptom, drug>)should be similar.

Given the above observations, we define a reverse causal leverage of CAR, denoted as reverse causalleverage( CX Y ), with respect to pair <X,Y>.

reverse causal leverage( CX Y )= causal leverage( CY X )= supp ( CY X) supp ( CY ) supp ( X ) (11)

where supp ( CY X ) is the support of the CAR CY X .supp ( CY ) is the proportion of sequences whose votesare greater than 0 with respect to the pair <Y,X>.supp( X ) is the proportion of sequences that contain X .The measure represents the degree of reverse causalitybetween X and Y . That is, it denotes the strength that Ycauses X . Intuitively, if X and Y are randomly distributed among the patient cases in a database and do nothave any causal relationships, causal leverage( CX Y )should have same or similar value as reverse causalleverage( CX Y ), no matter how high the frequency of thepair is. Otherwise, if X causes Y , the former should begreater than the latter. On the other hand, if Y causes X ,then the former should be smaller than the latter. Hence,we define an exclusive causal leveragemeasure as below:

exclusive causal leverage( CX Y )= causal leverage( CX Y ) reverse causal leverage( CX Y )

= supp ( CX Y ) supp ( CX ) supp ( Y )[supp ( CY X) supp ( CY ) supp ( X )] (12)

If this measure has a positive value, it denotes that X hascausality with Y . The bigger the value, the stronger thecausal association. If its value is negative, it indicates areverse causality. Otherwise, a value close to zero is a

sign of no causal association between X causes Y . Thus,this measure not only provides a way to quantify the degree of causal association but also excludes undesirableeffects caused by high frequency events that occur randomly.

4.2 Mining ADR Signal Pairs from Electronic Pa-tient Databases

Drugs and their associated ADRs have causal relationships. In this subsection, we examine how to search forpotential ADR signal pairs from an electronic patient database using the above exclusive causal leverage measure.We assume that patient data are stored in relational tablesin a database and can be retrieved using database language like Structured Query Language (SQL). Thesetables are linked through patient identification numbers(PIDs). We also assume that the drug related data andsymptom related data are stored in two tables called Patient Drug Table and Patient Symptom Table, respectively.

Fig. 3 shows an overall picture of the data mining algorithm. First, preprocessing is needed to get two types ofinformation: 1) a list of all drugs D(d1, d2, …, dm) in thedatabase and the support count for each drug; 2) a list ofall symptoms S(s1, s2, … sn) in the database and the support count for each symptom. The lists of drugs andsymptoms are needed to form all possible drug symptompairs whose causal strengths will be assessed. Since a patient database normally only contains a subset of all drugson the market and a subset of all symptoms, it is necessary to search the Patient Drug Table and the PatientSymptom Table to get the drugs and symptoms coveredby the database. Using the discovered drugs and symptoms in the database (instead of all drugs on the marketand all available symptoms) can avoid forming unnecessary pairs, which will reduce the computational complexity of the data mining algorithm. In addition, the support

Fig. 3. Algorithm for mining causal association rules

Fig. 4. Data structures for storing drug/symptom support counts. (a) drug hash table, (b) symptom hash table

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 8: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

count for each drug or symptom will be used to calculatethe exclusive causal leverage value for related pairs. Hencetheir values also need to be computed.

Algorithm 1 shows how to search a database (DB) forthe list of drugs and the support count for each drug. Thediscovered drugs and their support counts are stored in ahash table as shown in Fig 4a. Initially, the hash table isempty. We use Dk to represent the set of drugs taken bythe kth patient pk in the database. For each patient, we firstretrieve all the drugs taken the patient. For each of thesedrugs, we then check whether the hash table contains thedrug. If the hash table does not contain the drug, thedrug’s support count is set as 1, and both the drug and itssupport count are added to the hash table. Otherwise, thedrug’s support count is increased by 1. After all the patient cases are searched, the hash table is returned. Notethat, when calculating the support count for a drug, it iscounted only once for one patient case even if the drugappears several times in that patient case. One can seethat the computational complexity of this searchingprocess can be affected by the total number of patients Nin the database and the average number of drugs takenby a patient. The later is normally determined by the typeof patients in the database. For instance, old patients oftenhave multiple diseases and thus may take multiple drugseither at the same time or different times.

The algorithm used to search the Patient Symptom Table for a list of symptoms covered by the database as wellas the support count for every symptom (Fig 4b) is similarto Algorithm 1. If users are only interested in mining thepotential ADRs of a particular drug or a couple of drugs,the users can specify the drugs of interest. Similarly, theusers can also specify the list of symptoms if they want toanalyze which drugs can cause the symptoms of interest.In both cases, however, the Patient Drug Table and thePatient Symptom Table still need to be searched in orderto get the support count for each drug or symptom.

After getting the list of drugs D(d1, d2, …, dm) and thelist of symptoms S(s1, s2, … sn), the next step is to generateall the possible pairs, each of which represents a CAR.Algorithm 2 shows the process for pair generation andevaluation. Please note that most existing data miningmethods mine all interesting association rules that combine all possible events or items in a database. We areonly interested in mining drug symptom pairs that can be

easily generated, given D and S. That is, we are not interested in other patterns like drug drug pairs, symptomsymptom pairs or combinations of multiple drugs andsymptoms. Thus, our algorithm generates a much fewernumber of candidate rules, which implies much lesscomplexity. The complexity of this pair generationprocess is O(m×n) where m and n are the number of drugsand symptoms, respectively. In addition, since we areinterested in mining infrequent patterns, it is inappropriateto prune pairs using the support measure (i.e. support >minsupp). However, in post marketing surveillance, a signal pair generated by a data mining method is generallynot considered as valid if only one or two patient casescontain the pair [9]. Therefore, we utilize a minimumsupport count mincount=3 (instead of minsupp) to furtherreduce the number pairs that will be evaluated. Specifically, we retrieve the PIDs of those patient cases that containboth the drug and the symptom of each pair. This is doneby sending a SQL statement to query the database. Modern database management systems can utilize optimization techniques like index to speed up this process. If thenumber of PIDs that support a pair is greater than orequal to mincount (Line 4 of Algorithm 2), the pair is evaluated. To evaluate the strength of the causal associationof the pair, its causal leverage value is first computed.Then, the pair is reversed. That is, the pair < di, sj > becomes < sj, di >. The reverse causal leverage value of pair < di,sj > is equal to the causal leverage value of its reversed pair< sj, di >. After that, the exclusive causal leverage value of thepair is computed by subtracting its reverse causal leveragevalue from its normal causal leverage value.

Algorithm 3 shows how to compute the causalleverage value of a general pair between event X and Y .

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 9: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 9

Both X and Y could be either drug event or symptomevent. First, the drug or symptom hash table is searchedin order to get the support count for event Y . Then, foreach PID that supports the pair, a process called cue abstraction is used to extract a set cue values V from therelated patient case. Specifically, a list of drug start datesand a list of symptom dates are retrieved from the PatientDrug Table and the Patient Symptom Table, respectively.Note that, since each patient case records the patient’shistory for a long period of time, the same drug may beprescribed many times and thus multiple drug start datesmay exist within one patient case. Similarly, there mayexist multiple symptom dates for the symptom within thesame patient case. Therefore, we must compare each startdate of the drug with each symptom date of the symptomin order to obtain interested temporal patterns for the pairwithin the case. The complexity of this process is O(Ld×Ls),where Ld and Ls are the length of the list of drug startdates and the length of the symptom dates, respectively.Ld and Ls often depend on the characteristics of the patient. For example, a patient with chronic diseases tendsto take the same drug for many times and have repeatedsymptoms. After getting the temporal patterns, cue values for temporal association, rechallenge and dechallenge arederived from these patterns using fuzzy rules. Note that,in order to make our algorithm more generic, the cue other explanations was not utilized in this study since it isdrug dependent. Readers are referred to our previouspaper for specific fuzzy rules that were used in thisprocess [16].

After the set of cue values V of the pair are extractedfrom the related patient case, similarity values are computed between V and each set of cue values V’ in the experience knowledge base (EKB). This step is also calledfeature matching in the Fuzzy RPD model. These similarity values are then normalized so that they are transformed to the format shown in Table 2. After that, thedegree of causality C<X, Y> within the current patient iscomputed. If C<X, Y> is greater than 0, it is added to the accumulated votes and the number of cases whose votes aregreater than 0 is increased by one. After the above computation is done for all the supporting cases of the pair,the causal leverage value of the pair is computed (refer toEq. (10)) and returned.

Finally, we rank all the pairs in a decreasing order according to their exclusive causal leverage values after allthese values are computed. The higher the value, themore likely a causal association exists in the pair or CAR.

5 EXPERIMENTS5.1 Experiment data Our experiment data were from Veterans Affairs MedicalCenter in Detroit, Michigan. We were interested in twoclasses of drugs: statin drugs and ACEI (angiotensinconverting enzyme inhibitors) drugs. De identified electronic data of all the patients who received at least one ofthe drugs of our interest during the time period from September 4, 2007 to September 16, 2009 were retrieved.“Event” data such as dispensing of drug, office visits, and

certain laboratory tests were retrieved for all the patients.For each event certain details were obtained. For example,the data for dispensing of drug include the name of thedrug, quantity of the drug dispensed, dose of the drug,drug start date, drug schedule, and the number of refills.

The retrieved data included 16,206 patients. 15,605(96.3%) were male, and 601 (3.7%) were female. Their average age was 68.0. The drugs enalapril, pravastatin androsuvastatin were selected to test the proposed data mining framework. Pravastatin and rosuvastatin are bothstatin drugs, and enalapril is an ACEI drug. These threedrugs allow us to examine the effectiveness of our datamining framework in identifying potential ADRs fordrugs both in the same drug class and in different drugclasses. The total number of patients who took one of thethree drugs enalapril, pravastatin and rosuvastatin atleast once is 1021, 1383, and 882, respectively. Note thatmultiple drugs might be prescribed for the same patienteither at the same date or different dates along time. Thetotal number of distinctive ICD 9 codes was 3,954 in thedata. Each code represented a potential ADR that mightbe causally associated with any drug.

The data were stored in five relational tables in a Microsoft Access database. These tables contained patients’demographic data, visit data, diagnostic data, drugrelated data, and laboratory data. SQL was used to querythe tables and obtain desired information. The query results were returned to Java programs that implementedthe data mining algorithm. Fuzzy sets and fuzzy ruleswere coded using Fuzzyjess [52], a Java based fuzzy inference engine.

5.2 Experiment Results Table 3 gives the causal leverage, reverse causal leverage andexclusive causal_leverage between drug enalapril and eachof seven ICD 9 codes: 276.7 (hyperkalemia), 366.02 (postsubcaps polar cataract), 428.0 (congestive heart failure),327.27 (central sleep apnea in conditions classified elsewhere), 305.60 (cocaine abuse unspecified), 275.42(Hypercalcemia), and 525.9 (unspecified disorder of theteeth and supporting structures). The code 276.7represents a well known ADR caused by drug. The tableshows that its causal leverage is much larger than its reverse

TABLE 3COMPARING THE DEGREE OF ASSOCIATION BETWEEN ENALAPRIL

AND EACH OF SEVEN ICD-9 CODES UNDER DIFFERENT MEAS-URES (HAZARD PERIOD T = 90 DAYS)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 10: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

causal leverage, and thus its exclusive causal_leverage is closer to its causal leverage. The code 366.02 was considered byour physicians as neither an ADR nor a medical conditiondue to drug enalapril. One can see that its causal leverageand reverse causal leverage are very close even though either one is larger than the causal leverage of 276.7. Thecode 428.0 denotes a medical condition for taking enalapril. The table indicates that its causal leverage is smallerthan its reverse causal leverage, which results in a negativeexclusive causal_leverage value. The table also includes fourother disease conditions that are not related to drug enalapril. That is, they are neither ADRs caused by enalaprilnor medical conditions that cause the prescription of enalapril. The code 327.27 has the smallest positive exclusivecausal_leverage value among all those unrelated codes.Based on the codes 276.7 and 327.27, one can see that thepositive range of exclusive causal_leverage values is 2.94E 7~ 1.36 E 4 with respect to drug enalapril. The code 305.60has the largest positive exclusive causal_leverage valueamong those unrelated codes. Its reverse causal leverage is0, which means that no cases exhibit reasonable temporalpatterns for the reverse pair. According to Table 4, therank of this code is as high as 6. This indicates that someunrelated codes could still be ranked high due to thecomplexity of the data. The codes 275.42 and 525.9 aresimply two randomly selected symptoms that are notrelated to drug enalapril. They exhibit similar propertiesas codes 366.02 and 327.27. That is, their causal leverageand reverse causal leverage values are relatively large, buttheir absolute values of the exclusive causal_leverage measure are small. Ranking these seven ICD 9 codes usingtheir causal leverage values, the code 428.0 will demonstrate the strongest causal association with drug enalapril.But if exclusive causal_leverage is used, the code 276.7 (i.e.the known ADR) will have the strongest association with

the drug. Therefore, the table indicates that our algorithmcan indeed effectively remove some undesirable effectscaused by frequent events.

For each of the three selected drugs, we computed theexclusive causal leverage values for all the pairs betweenthe drug and each of the 3,954 ICD 9 codes. We rankedthe codes in a decreasing order according to each pair’sexclusive causal leverage value. The top 10 codes were evaluated by our physicians to determine whether they werelikely real ADRs associated with the drug. The physiciansmade decisions based on the Physician’s Desk Reference[53], literature reporting, and their clinical experience.We also ranked the association between the drug andeach ICD 9 code using three other measures: causalleverage, traditional leverage without considering causality, and risk ratio. Risk ratio is defined as the probabilitythat a case coexists with the ICD 9 code relative to theprobability that a non case coexists with the same ICD 9code. Risk ratio and leverage represent two typical frequency based interestingness measures. The hazard period was set as T=90 days when causal leverage and exclusive causal leverage were calculated. The value of this parameter needs to be chosen carefully since if it is too large,some unrelated symptoms could form false temporal relationships with the drug. If it is too small, real signals maybe excluded because some ADRs only happen after adrug has been taken for as long as tens of days. After balancing these two situations, our physicians think 90would be an appropriate value for this parameter.

Table 4 lists the top 10 ICD 9 codes identified by theexclusive causal leverage measure for drug enalapril. Eightof them were thought as real ADRs associated with enalapril with high likelihood. Their ranks in the three other

TABLE 5COMPARING RANKS OF ICD-9 CODES GIVEN DRUG PRAVASTA-

TIN UNDER FOUR DIFFERENT INTERESTINGNESS MEASURES (NOS - NOT OTHERWISE SPECIFIED; NEC – NOT ELSEWHERE

CLASSIFIED)

TABLE 4COMPARING RANKS OF ICD-9 CODES GIVEN DRUG ENALAPRIL

UNDER FOUR DIFFERENT INTERESTINGNESS MEASURES

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 11: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 11

measures are also included in the table. One can see thatneither risk ratio nor leverage ranked the eight ICD 9codes highly. The table also indicates that the results generated by these two measures (i.e. risk ratio and leverage)are much more similar than those generated by the othertwo measures. By looking at the ranks of the eight ICD 9codes, the causal leverage measure produced much betterperformance as compared with the risk ratio and leveragemeasures since it could capture the causal relationshipbetween a pair. Even though the tradition leverage measure obtained comparable or even better ranks for two ofthe eight ICD 9 codes (i.e. 276.7 and 233.7), it ranked theremaining six ICD 9 codes as low as around 1000. Withcapability of reducing the undesirable effects of frequentICD 9 codes, the exclusive causal leverage measureachieved the best performance.

The experiment results for drug pravastatin are givenin Table 5. Seven out of the top 10 ICD 9 codes were identified as likely ADRs by our physicians. The causalleverage measure gives higher ranks for all these sevenICD 9 codes than both risk ratio and traditional leveragemeasures. Another interesting observation is that thecausal leverage measure classified three out of the sevenICD 9 codes into its top 10 list. This implies that thismeasure does have the potential to discover some potential ADRs because of its capability in capturing the causalrelationship between a drug and a symptom. Consistently, the exclusive causal leverage measure again achieved thebest performance.

Table 6 gives the experiment results for drug rosuvastatin. Six out of the top 10 ICD 9 codes were found to belikely ADRs of this drug. In terms of the relative abilitiesof the four interestingness measures in discovering potential ADRs, these results are consistent with the resultsdemonstrated by the other two drugs.

Since the length of hazard period T affects the performance of our exclusive causal leverage measure, we testedthe rankings of three randomly selected known ADRs ofdrug enalapril for different T values (Table 7). Either 30 or

150 does not seem to be appropriate values for T sincesome known ADRs were ranked very low. Similar rankings were achieved when Twas set to 60, 90 or 120 days.

6 DISCUSSION

The above experiment results showed that our exclusivecausal leverage measure outperformed traditional interestingness measures like risk ratio and leverage because ofits ability to capture suspect causal relationships and exclude undesirable effects caused by frequent events.However, due to the complexity, incompleteness, andpotential bias (e.g. tolerable ADRs sometimes may not berecorded) of the data, some ICD 9 codes may still be falsely ranked high based on the exclusive causal leverage. Thatis, false associations cannot be completely avoided usingour method. Please note that, in post marketing surveillance, data mining is primarily used to shorten the longlist of potential drug ADR pairs. Like the other data mining methods, the signal pairs found by our approach willbe subject to further analysis (e.g., epidemiology study),case review, and interpretation by drug safety professionals experienced in the nuances of pharmacoepidemiologyand clinical medicine [54].

When compared to the ADRs from clinical trials, theADRs identified by the exclusive causal leverage measurefor the statins (pravastatin and rosuvastain) are uniqueand different. Clinical trials revealed some similar ADRsbetween the two statins including nausea, headaches,myalgia, and dizziness. However, the ADRs in thisstudy, as listed in Tables 5 and 6, revealed no overlap ofADRs between the two statins. One reason for this disparity could be that we only evaluated the top 10 symptomsranked by our measure. The difference may also be accounted by the differences in setting and sampling of cases. Clinical trial data come from double blinded placebostudies, but the hospital data represent a naturalisticstudy. In clinical trials, the participants are individualswith hyperlipidemia as their only diagnosis or the primary diagnosis, and they are not allowed to have any complicated medication regimen or unstable medical conditions. The data set in our study derived from hospitalized patients, who are primarily male (96.3%) with meanage of 68, who normally have multiple diagnoses in various severities and taking variety of medications. As aresult, the ADRs derived from the exclusive causal leveragemeasure represent more naturalistic and real patients inclinical care. The method allows us to examine and evaluate complicated samples of patients to draw usefulcausal information.

TABLE 6COMPARING RANKS OF ICD-9 CODES GIVEN DRUG ROSUVAS-TATIN UNDER FOUR DIFFERENT INTERESTINGNESS MEASURES

TABLE 7RANKINGS OF THREE KNOWN ADRS OF DRUG ENALAPRIL UN-

DER DIFFERENT LENGTHS OF HAZARD PERIOD T

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 12: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

The focus of this paper is on how to design a novel interestingness measure and apply the result to the important adverse drug reaction problem (i.e., finding interesting ADR signal pairs). Algorithm design and efficiencyanalysis become more important when one studies howto efficiently mine all possible rare eventsets and association rules based on minimal support. Another differenceis that according to the literature, existing rare associationrule mining research assumes that all the data are storedin a single table. In contrast, our work must use data froma relational (medical) database composed of several related tables. Although designing efficient SQL statementsand data models to handle large volumes of data in a relational database is beyond the scope of this paper, wewill consider developing an efficient algorithm suitablefor relational databases and analyze its complexity andefficiency in the future.

7 CONCLUSION

Mining the causal association between two events is veryimportant and useful in many real applications. It canhelp people discover the causality of a type of events andavoid its potential adverse effects. However, mining theseassociations is very difficult especially when events ofinterest occur infrequently. We have developed a newinterestingness measure, exclusive causal leverage, based onan experience based fuzzy RPD model. This measure canbe used to quantify the degree of association of a CAR.Moreover, the measure was designed to mask the undesirable effects caused by high frequency events. We haveapplied this measure to detect the causal associations between each of the three drugs (i.e., enalapril, pravastatin,and rosuvastatin) and the ICD 9 codes. A data miningalgorithm was developed to search a real electronic patient database for potential ADR signals. Experimentalresults showed that our algorithm could effectively makeknown ADRs rank high among all the symptoms in thedatabase.

ACKNOWLEDGMENTWe would like to thank Ellison Floyd and Richard E. Miller for retrieving the patient information from the VAMCDetroit’s databases. This work was supported in part byNational Institutes of Health under grant 1 R21GM082821 01A1.

REFERENCES[1] J. Talbot and P. Waller, Stephens Detection of New Adverse Drug

Reactions, 5 ed.: John Wiley & Sons, 2004.[2] L. T. Kohn, J. M. Corrigan, and M. S. Donaldson, To error is

human: building a safer health system. Washington, DC: NationalAcademy Press, 2000.

[3] J. Lazarou, B. H. Pomeranz, and P. N. Corey, Incidence ofadverse drug reactions in hospitalized patients: a meta analysisof prospective studies, JAMA, vol. 279, pp. 1200 5, Apr 15 1998.

[4] K. Hartmann, A. K. Doser, and M. Kuhn, Postmarketing safetyinformation: how useful are spontaneous reports?,Pharmacoepidemiology & Drug Safety, vol. 8, pp. 65 71, 1999.

[5] L. Hazell and S. A. W. Shakir, Under Reporting of AdverseDrug Reactions A Systematic Review, Drug Safety, vol. 29, pp.385 396, 2006.

[6] K. E. Lasser, P. D. Allen, S. J. Woolhandler, D. U. Himmelstein,S. M. Wolfe, and D. H. Bor, Timing of new black box warningsand withdrawals for prescription medications, Journal of theAmerican Medical Association, vol. 287, pp. 2215 20, 2002.

[7] A. Szarfman, J. M. Tonning, and P. M. Doraiswamy,Pharmacovigilance in the 21st century: new systematic toolsfor an old problem., Pharmacotherapy, vol. 24, pp. 1099 104,2004.

[8] M. Lindquist, I. R. Edwards, A. Bate, H. Fucik, A. M. Nunes,and M. Stahl, From association to alert a revised approach tointernational signal analysis, Pharmacoepidemiology & DrugSafety, vol. 1, pp. 15 25, 1999.

[9] S. J. Evans, P. C. Waller, and S. Davis, Use of proportionalreporting ratios (PRRs) for signal generation from spontaneousadverse drug reaction reports, Pharmacoepidemiology & DrugSafety, vol. 10, pp. 483 6, 2001.

[10] W. DuMouchel, Bayesian Data Mining in Large FrequencyTables, With an Application to the FDA Spontaneous ReportingSystem, Am Stat pp. 177 190, 1999.

[11] P. Purcell and S. Barty, Statistical techniques for signalgeneration: the Australian experience, Drug Safety, vol. 25, pp.415 21, 2002.

[12] M. Hauben, Early postmarketing drug safety surveillance: datamining points to consider, Annals of Pharmacotherapy, vol. 38,pp. 1625 30, Oct 2004.

[13] Y. Ji, H. Ying, M. S. Farber, J. Yen, P. Dews, R. E. Miller, and R.M. Massanari, A Distributed, Collaborative Intelligent AgentSystem Approach for Proactive Postmarketing Drug SafetySurveillance, IEEE Transactions on Information Technology inBiomedicine, vol. 14, pp. 826 837, 2010.

[14] H. Jin, J. Chen, H. He, G. Williams, C. Kelman, and C. O Keefe,Mining unexpected temporal associations: applications indetecting adverse drug reactions, IEEE Transactions onInformation Technology in Biomedicine, vol. 12, pp. 488 500, 2008.

[15] G. N. Norén, J. Hopstadius, A. Bate, K. Star, and I. R. Edwards,Temporal pattern discovery in longitudinal electronic patientrecords Data Mining and Knowledge Discovery, vol. 20, pp. 361387, 2010.

[16] Y. Ji, H. Ying, P. Dews, M. S. Farber, A. Mansour, J. Tran, R. E.Miller, and R. M. Massanari, A Fuzzy Recognition PrimedDecision Model Based Causal Association Mining Algorithmfor Detecting Adverse Drug Reactions in PostmarketingSurveillance, presented at the Proceedings of The IEEEInternational Conference on Fuzzy Systems, IEEE WorldCongress on Computational Intelligence, Barcelon, Spain, 2010.

[17] Y. Ji, H. Ying, P. Dews, A. Mansour, J. Tran, R. E. Miller, and R.M. Massanari, A Potential Causal Association MiningAlgorithm for Screening Adverse Drug Reactions inPostmarketing Surveillance, Information Technology inBiomedicine, IEEE Transactions on, vol. 15, pp. 428 437, 2011.

[18] Y. Ji, R. M. Massanari, J. Ager, J. Yen, R. E. Miller, and H. Ying,A Fuzzy Logic Based Computational Recognition PrimedDecision Model, Information Science, vol. 177, pp. 4338 4353,2007.

[19] P. N. Tan, M. Steinbach, and V. Kumar. (2005). Introduction toData Mining.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 13: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 13

[20] L. Geng and H. J. Hamilton, Interestingness Measures for DataMining: A Survey, ACM Computing Surverys, vol. 38, 2006.

[21] A. K. H. Tung, H. Lu, J. Han, and L. Feng, Efficient mining ofintertransaction association rules IEEE Transactions onKnowledge and Data Engineering, vol. 15, pp. 43 56, 2003.

[22] C. Marinica and F. Guillet, Knowledge Based InteractivePostmining of Association Rules Using Ontologies, IEEETransactions on Knowledge and Data Engineering, vol. 22, pp. 784797, 2010.

[23] R. AGRAWAL and R. SRIKANT, Fast algorithms for miningassociation rules, presented at the Proceedings of the 20thInternational Conference on Very Large Databases, Santiago,Chile, 1994.

[24] P. N. Tan and V. Kumar, Interestingness Measures forAssociation Patterns : A Perspective, Department of ComputerScience, University of Minnesota, 2000.

[25] W. Klosgen, Explora: a multipattern and multistrategydiscovery assistant, in Advances in knowledge discovery and datamining, U. M. Fayyad, et al., Eds., 1st ed Cambridge, MA: MITPress, 1996, pp. 249 271.

[26] N. Lavrac, P. Flach, and B. Zupan, Rule Evaluation Measures:A Unifying View, presented at the Proceedings of the 9thInternational Workshop on Inductive Logic Programming,Bled, Slovenia, 1999.

[27] G. M. Weiss, Mining with rarity: a unifying framework,SIGKDD Explor. Newsl., vol. 6, pp. 7 19, 2004.

[28] B. Liu, W. Hsu, and Y. Ma, Mining association rules withmultiple minimum supports, presented at the Proceedings ofthe fifth ACM SIGKDD international conference on Knowledgediscovery and data mining, San Diego, California, UnitedStates, 1999.

[29] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, Mining associationrules on significant rare data using relative support, J. Syst.Softw., vol. 67, pp. 181 191, 2003.

[30] Y. H. Hu and Y. L. Chen, Mining association rules withmultiple minimum supports: a new mining algorithm and asupport tuning mechanism, Decis. Support Syst., vol. 42, pp. 124, 2006.

[31] L. Szathmary, P. Valtchev, and A. Napoli, Finding minimalrare itemsets and rare association rules, presented at theProceedings of the 4th international conference on Knowledgescience, engineering and management, Belfast, NorthernIreland, UK, 2010.

[32] L. Szathmary, A. Napoli, and P. Valtchev, Towards RareItemset Mining, presented at the Proceedings of the 19th IEEEInternational Conference on Tools with Artificial IntelligenceVolume 01, 2007.

[33] L. Troiano, G. Scibelli, and C. Birtolo, A Fast Algorithm forMining Rare Itemsets, presented at the Proceedings of the 2009Ninth International Conference on Intelligent Systems Designand Applications, 2009.

[34] Y. S. Koh and N. Rountree, Finding sporadic rules usingApriori Inverse, vol. 3518 LNAI, ed, 2005, pp. 97 106.

[35] M. Adda, L. Wu, and Y. Feng, Rare itemset mining, 2007, pp.73 80.

[36] C. Zhang and S. Zhang, Association Rule Mining: Models andAlgorithms. New York: Springer Verlag, 2002.

[37] I. Guyon, D. Janzing, and B. Schölkopf, Causality: Objectivesand Assessment, JMLR Workshop and Conference Proceedings,vol. 6, pp. 1 38, 2010.

[38] J. Pearl, Causality: Models, Reasoning and Inference, 2nd ed.:Cambridge University Press, 2009.

[39] D. Koller and N. Friedman, Probabilistic Graphical Models:Principles and Techniques, 1st ed.: The MIT Press, 2009.

[40] M. Hassoun, Fundamentals of Artificial Neural Networks: The MITPress, 2003.

[41] G. A. Fink, Markov Models for Pattern Recognition: From Theory toApplications: Springer, 2010.

[42] D. Heckerman, Bayesian Networks for Data Mining, DataMin. Knowl. Discov., vol. 1, pp. 79 119, 1997.

[43] G. F. Cooper, A Simple Constraint Based Algorithm forEfficiently Mining Observational Databases for CausalRelationships, Data Min. Knowl. Discov., vol. 1, pp. 203 224,1997.

[44] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler,Hidden Markov models in computational biology.Applications to protein modeling, J Mol Biol, vol. 235, pp. 150131, 1994.

[45] S. R. Eddy, Profile hidden Markov models, Bioinformatics, vol.14, pp. 755 63, 1998.

[46] G. A. Klein, A recognition primed decision making model ofrapid decision making, Decision Making In Action: Models andMethods, pp. 138 147, 1993.

[47] L. A. Zadeh, Fuzzy sets, Information and Control, vol. 8, pp.338 353, 1965.

[48] H. Ying, Fuzzy Control and Modeling: Analytical Foundations andApplications: IEEE Press, 2000.

[49] I. R. Edwards and J. K. Aronson, Adverse drug reactions:definitions, diagnosis, and management., Lancet, vol. 356, pp.1255 9, 2000.

[50] Y. Ji, H. Ying, J. Yen, and R. M. Massanari, A Fuzzy LogicBased Computational Recognition Primed Decision Model,Information Science, vol. 77, pp. 4338 4353, 2007.

[51] M. Cayrol, H. Farreny, and H. Prade, Fuzzy PatternMatching, Kybernetes, vol. 11, pp. 103 116, 1982.

[52] R. Orchard, Fuzzy Reasoning in Jess: The FuzzyJ Toolkit andFuzzyJess, in Third International Conference on EnterpriseInformation Systems, Setubal, Portugal, 2001, pp. 533 542.

[53] P. Staff, Ed., Physicians Desk Reference 2011. PDR Network,2010.

[54] M. Hauben and X. Zhou, Quantitative methods inpharmacovigilance: focus on signal detection, Drug Safety, vol.26, pp. 159 86, 2003.

Yanqing Ji received the B.Eng. degree in industrial automation from Qingdao University, Qingdao, China, in 1997, the M.Sc. degree in physical electronics from the University of Science and Technology of China, Hefei, Chi-na, in 2001, and the Ph.D. degree in computer engineering from Wayne State University, Detroit, MI, in 2007. He is currently an Assis-tant Professor with the Department of Electric-al and Computer Engineering, Gonzaga Uni-versity, Spokane, WA. His research interests include multiagent systems, parallel and distri-

buted systems, data mining, and their biomedical applications. Dr. Ji is a member of the Program Committees of various international conferences and workshops including the 2009 International Work-shop on Graphics Processing Unit Technologies and Applications, and reviewer for various international journals.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Page 14: A Method for Mining Infrequent Causal Associationsand Its Application in Finding Adverse Drug Reaction Signal Pair

14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Hao Ying (S’88–M’90–SM’97-F’2012) re-ceived the B.S. degree in electrical engineer-ing, the M.S. degree in computer engineering from Donghua University, Shanghai, China, in 1982 and 1984, respectively, and the Ph.D. degree in biomedical engineering from The University of Alabama, Birmingham, in 1990. He is currently a Professor with the Depart-ment of Electrical and Computer Engineering, Wayne State University, Detroit, MI. He has authored a book entitled Fuzzy Control and

Modeling: Analytical Foundations and Applications (IEEE Press, 2000), over 90 peer-reviewed journal papers, and more than 130 conference papers. Prof. Ying is an Associate Editor or a member of Editorial Board for 10 international journals and is a member of the Fuzzy Systems Technical Committee of the IEEE Computational Intelligence Society. He was Program Chair for The 2005 North American Fuzzy Information Processing Society (NAFIPS) Confe-rence and Co-Program Chair for the 2010 NAFIPS Conference as well as for The International Joint Conference of NAFIPS Confe-rence, the Industrial Fuzzy Control and Intelligent System Confe-rence, and the NASA Joint Technology Workshop on Neural Net-works and Fuzzy Logic held in 1994.

John Tran received his BS in Chemistry from Santa Clara Universi-ty, California in 1990 and MD from School of Medicine, St. Louis University in 1995. John completed his Residency Training in Gen-eral Psychiatry from Washington University in St. Louis, Missouri in 1999 and his Fellowship Training in Geriatric Psychiatry from Uni-versity of Washington (UW), Seattle, Washington in 2000. He com-pleted his Post-Doctoral Research Fellowship in Geriatric Psychiatry from UW in 2002. He is currently Board Certified in General and Geriatric Psychiatry. John currently serves as the Medical Director and Chief of Geriatric Psychiatry for Frontier Behavioral Health. He is responsible for directing and providing psychiatric services to au-dult and elderly patients in outpatient psychiatric clinic as well as in long term care settings, including skilled nursing facilities, assistant livings, and adult family homes. He currently supervised 11 psy-chiatrists and 10 Advanced Registered Practice Nurses. John has great interests in clinical research and teaching. He is the Director and Primary Instigator for Frontier Institute where he conduct NIH and Industry sponsored clinical trials for new medication develop-ments for psychiatric disorders including depression, anxiety, schi-zophrenia, dementias, etc. He also serves as Associate Clinical Professor in School of Medicine, University of Washington, as well as Adjunct Associate Clinical Professor in School of Pharmacy and School of Nursing, Washington State University, in teaching and training of medical students, psychiatry residents, pharmacy and nursing students.

Peter Dews received his medical degree from Wayne State Univer-sity in 1986. He also holds a bachelor's degree in microbiology and a master's degree in clinical research design and statistical analysis from the University of Michigan. He is currently the Chief Medical Officer and Vice President of Medical Aff airs for St. Mary Mercy Hospital, Trinity Health. He was the medical director of ambulatory services for the Wayne State University Physician Group's Depart-ment of Internal Medicine and had practiced general medicine for nearly 15 years before he joined Trinity Health. Dr. Dews is a mem-ber of the Society of General Internal Medicine, the American Col-lege of Physicians, the Association of Health Services Research, the American Medical Informatics Society and the American College of Physician Executivesphotograph and biography not available.

Ayman Mansour received his B.Sc in Electrical and Electronics Engineering from University of Sharjah, Sharjah, U.A.E, in 2004 and his M.Sc. degree in Electrical Engineering from the Univertsity of Jordan, Amman, Jordan, in 2006. He joined the Department of Elec-trical and Computer Engineering at Wayne State University, Detroit, Michigan in 2008 as a PhD student. His research interests include multiagent systems, intelligent systems and data mining.

R. Michael Massanari is a graduate of the Feinberg School of Med-icine, Northwestern University, Chicago, IL, in 1967. He received post-graduate training at the University of Pennsylvania in Philadel-

phia and Northwestern University in Chicago. He also received a M.S. degree in Preventive Medicine and Epidemiology in 1985 from the University of Iowa in Iowa City, Iowa. He is currently Executive Director of the Critical Junctures Institute in Bellingham, WA. He served as Director of the Center for Healthcare Effectiveness Re-search (CHER) and Professor of Medicine and Community Medicine at Wayne State University until his retirement in 2009. Dr. Massanari is a Fellow of the American College of Physicians and a member of the Academy of Health and American Public Health Association.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.