knowledge discovery from massive healthcare...

Knowledge Discovery from Massive Healthcare ClaimsData

Varun ChandolaOak Ridge National

[email protected]

Sreenivas R. SukumarOak Ridge National

[email protected]

Jack SchryverOak Ridge National

[email protected]

ABSTRACTThe role of big data in addressing the needs of the presenthealthcare system in US and rest of the world has beenechoed by government, private, and academic sectors. Therehas been a growing emphasis to explore the promise of bigdata analytics in tapping the potential of the massive health-care data emanating from private and government healthinsurance providers. While the domain implications of suchcollaboration are well known, this type of data has beenexplored to a limited extent in the data mining community.The objective of this paper is two fold: first, we introduce theemerging domain of“big”healthcare claims data to the KDDcommunity, and second, we describe the success and chal-lenges that we encountered in analyzing this data using stateof art analytics for massive data. Specifically, we translatethe problem of analyzing healthcare data into some of themost well-known analysis problems in the data mining com-munity, social network analysis, text mining, and temporalanalysis and higher order feature construction, and describehow advances within each of these areas can be leveragedto understand the domain of healthcare. Each case studyillustrates a unique intersection of data mining and health-care with a common objective of improving the cost-careratio by mining for opportunities to improve healthcare op-erations and reducing what seems to fall under fraud, waste,and abuse.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications—Data Mining

KeywordsHealthcare Analytics; Fraud Detection

1. INTRODUCTIONHealthcare spending in United States is one of the key

issues targeted by policy makers, owing to the fact that it is

(c) 2013 Association for Computing Machinery. ACM acknowledges that this con-tribution was authored or co-authored by an employee, contractor or affiliate of theUnited States government. As such, the United States Government retains a nonexclu-sive, royalty-free right to publish or reproduce this article, or to allow others to do so,for Government purposes only.KDD’13, August 11–14, 2013, Chicago, Illinois, USA.Copyright 2013 ACM 978-1-4503-2174-7/13/08 ...$15.00.

a major contributor to the high national debt levels that areprojected for next two decades. In 2008, the total healthcarespending in US was 15.2% of its GDP (highest in the world)and is expected to reach as much as 19.5% by 2017 [2]. Butwhile the healthcare costs have risen (by as much as 131%in the past decade), the quality of healthcare in the US hasnot seen comparable improvements (See Figure 1) [24].

Figure 1: Life expectancy compared to healthcarespending from 1970 to 2008, in the US and the next19 most wealthy countries by total GDP [14].

Experts agree that inefficiencies in the current healthcaresystem, resulting in unprecedented amounts of waste, is theprimary driver for the discrepancy between the spending andthe returns in the healthcare domain [11]. Recent studiesestimate that close to 30% (∼ $765 billion in 2009) of totalhealthcare spending in United States is wasted, which inturn is caused by many factors such as unnecessary services,fraud, excessive administrative costs, and inefficiencies inthe healthcare delivery.

In recent years, several experts as well as the federal gov-ernment1 have stressed on the role of big data analytics inaddressing the issues with healthcare. The 2011 report byMckinsey Global Institute [19] estimate that the potentialvalue that can be extracted from data in the healthcare sec-tor in US could be more than $300 billion per year. Thesame report lists out several areas within the healthcare sec-tor which can benefit from using big data analytics. Theseinclude segmentation of patients based on their health pro-files to identify target groups for proactive care or lifestyle

1http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf

1312

changes, development of fraud resistant payment models,creating information transparency and accessibility aroundhealthcare data, and conducting comparative effectivenessresearch across providers, patients, and geographies.

Healthcare insurance claims have the potential of answer-ing many of the questions currently faced by the healthcaresector, In fact, until shareable electronic health records be-come a reality, healthcare claims, especially from organiza-tions with a large spatial and demographic coverage such,which is the case with many of the government run healthinsurance programs in the country, are the most reliable re-source for understanding the current healthcare landscape,from conditions, care, and cost perspective. But the trans-actional format of claims data is not amenable for advanceanalytics that the state of art KDD methodologies have tooffer. In this paper we explore transformations of the health-care claims data which bridge this gap between the health-care domain and modern data analytics.

We present a study of big data analytics on health in-surance data collected by a large national social health in-surance program. While previous efforts have used datamining methods for analyzing healthcare claims data withinsmall organizations, we believe that this is one of the first in-stances where advanced data analytics and healthcare haveinteracted at a national scale. In all, we approximately ana-lyzed 2 billion insurance claims for approximately 45 millionbeneficiaries and over 3 million healthcare providers.

1.1 Our ContributionsIn this paper, we make the following contributions:

1. We introduce the emerging domain of health care claimsdata and identify multiple research problems that canbe solved using the existing big data analytics solu-tions.

2. We propose three transformations of the transactionalclaims data which enables application of state of artKDD methodologies in this domain.

3. We present several approaches to identify and under-stand the fraud, waste, and abuse in the health caresystem. The potential of each proposed approach isdemonstrated on real claims data and validated usinga true set of fraudulent providers.

4. We highlight the unique nature of healthcare data whenanalyzed using methods such as social network analy-sis and text analysis.

2. RELATED WORKThe role of big data in healthcare has been well acknowl-

edged across government and industrial sectors23. But onlya few published studies have analyzed such data [12]. Theprimary reason is the data availability, given that the health-care claims data has strong proprietary and privacy require-ments4. Moreover, existing studies have considered claimsdata from a payment system perspective and have analyzed

2www.eweek.com/database/emc-says-big-data-is-essential-to-improving-health-outcomes/3www.intel.com/content/dam/www/public/us/en/documents/white-papers/healthcare-leveraging-big-data-paper.pdf4http://www.hhs.gov/ocr/privacy/

Figure 2: Different Types of Healthcare Data

the data for payment errors [12]. Limited efforts exist thathave analyzed the healthcare claims data to understand theinefficiencies in the healthcare system [6], but from the ana-lytics perspective they are limited to simple summary statis-tics such as population means for various demographics. Inthis paper we explore the application of three advanced KDDtechnologies, viz. , text mining [18], social network analysis[29], and time series analysis [20], all of which have been suc-cessful in a variety of applications but have not been appliedat a large scale to healthcare claims data.

Domains such as credit card and property insurance havelong studied the issue of fraud identification [4, 10]. Buthealthcare fraud detection has unique characteristics giventhat the actual beneficiary is typically not the fraud perpe-trator, which is not the case for other domains. So existingfraud detection methods cannot be directly applied to thehealthcare domain. Most of the existing fraud detection so-lutions in the healthcare domain are not public, primarilybecause of the fact that the data is highly sensitive and isusually not made available for research and publishing.

3. BACKGROUNDHealthcare data can be broadly categorized into four groups

(See Figure 2): Clinical data (patient health records, med-ical images, lab and surgery reports, etc.) and patient be-havior data (collected through monitors and wearable de-vices) provide an accurate and detailed view of the health ofthe population. But such data, which is increasingly beingstored electronically, can be leveraged in a big data settingonly when the owners (doctors, hospitals, and individuals)share, which, owing to privacy concerns is still limited tobeing analyzed within an organization such as a hospital ora network of hospitals. Pharmaceutical research data(clinical trial reports, high throughput screening results) of-ten face privacy concerns owing to business practices. Inthis paper, we focus on health insurance data, which hasbeen collected and stored for several years by various healthinsurance agencies. While the primary justification for thisdata is to track payments and address fraud, such data alsohas great potential to address some of the other aforemen-tioned issues of the healthcare system. For United States,such data is extremely valuable, given that around 85% ofAmericans use some form of insurance (private or govern-ment). Moreover, insurance data is the only source of thecost associated with healthcare which is vital to address the

1313

economic challenges associated with modern healthcare sys-tem. The strong challenge presented by insurance data onthe other hand is that it is not readily in the form to inferstrong analytic insights into healthcare, besides the paymentmodel. A key contribution of this paper is the transforma-tion of the insurance data into formats that allow applicationof existing analytic tools for knowledge discovery.

3.1 Health Insurance DataThe typical health insurance payment model is a Fee-for-

service (FFS) model in which the providers (doctors, hos-pitals, etc.) render services to the patients and are paidfor each service by the payor or the insurance agency. Theproviders record the details of each service, including thecost and justification and submit the record to the payor.The payor decides to either pay or reject the claim based onthe patient’s eligibility for the particular service which aredetermined by the policy guidelines.

The insurance agency typically maintains three types ofdata for their operations:

1. Claim information captures the information about theservice transaction including the nature of the serviceand the cost.

2. Patient enrollment and eligibility data that capturesdemographic information about the patients (or bene-ficiaries of the system) and their eligibility for differentservices.

3. Provider enrollment data that captures the informa-tion about the physicians, hospitals, and other health-care providing organizations.

3.2 Health Insurance in United StatesIn the US, approximately 85% of the population has some

of form of health insurance. Majority of these individuals (≈60%) obtain insurance through their employer or employerof parent or spouse. Almost 28% of population (83 millionindividuals) is covered under government health insuranceprograms. These include programs such as Medicare, Medi-caid, Veterans Health Services, etc.

The data managed by each of these programs is at a mas-sive scale. Medicare alone provides health insurance to 48million Americans and covers for hospitalization, out pa-tient, medical equipments, and drugs. There are a few mil-lion providers enrolled with the Medicare Provider Enroll-ment, Chain, and Ownership System (PECOS). In 2011,Medicare received close to 1.2 billion claims (4.8 millionclaims per day) for their fee for service programs. An almostequivalent number of claims were received in the prescrip-tion drugs program (also known as Part D). Under Medi-caid, more than 60 million individuals received benefits in2009 (one in every five). In the state of Texas alone, thereare more than 0.5 million providers within Medicaid.

4. CHALLENGES AND OPPORTUNITIESAs mentioned in Section 1, fraud, waste, and abuse form

a significant amount of healthcare spending. Fraud rangesfrom single providers billing the system for services that werenot provided to large scale fraud carried out by organizedcriminals [27]. One form of waste happens due to improperpayments, since organizations are mandated to process pay-ments in a short duration of time, resulting in authorization

Providers Beneficiaries

Drugs

Procedures

~50 K

~17 K

~8 K

~4 M ~50 M

Diagnoses

Speciality~600

Figure 3: Entities and relationships in healthcareclaims data along with approximate number of en-tries in each entity set.

of double payments for duplicate claims, payments usingan outdated fee schedule, etc. A major cause for abuse isdue to the fact that programs such as Medicare follow aprospective payment system for hospital care, which meansthat providers are paid for services at predetermined rates.Thus if the actual service costs more than the allowed cost,the provider has to cover its losses. If the actual servicecosts less than the allowed cost, the provider keeps the re-mainder. This drives the providers to charge unnecessary ormore expensive services (also known as upcoding) by makingmore severe diagnosis to safeguard against any losses and tomake profit. Some of the examples listed above, e.g., doublepayment for duplicate claims, can be identified by applyingbusiness rules to the data. For others, such as upcoding ormiscoding providers, there is need for advanced algorithmsthat can analyze the vast amounts of claims data.

In the subsequent sections we describe three case stud-ies that were conducted onhealthcare claims and associateddata. The common thread among the studies is the analysisof the behavior and interaction between healthcare providers,which is highly important given that they are the primarydrivers for the wasteful spending in the system.

5. DATAThe data used for the subsequent case studies captures

three different aspects of healthcare. First is the claimsdata for close to 48 million beneficiaries for the entire US.Second is the provider enrollment data which can beobtained from several private organizations. The third is alist of fraudulent providers that have been sanctioned forfraudulent behavior in the state of Texas. The list of fraud-ulent providers was obtained from the Office of InspectorGeneral’s exclusion database5. Note that in this paper wewill treat the rest of the providers as non-fraudulent for eval-uation, even though it is evident that there are a significantnumber of fraudulent actors who have not been identified.

The claims and the provider enrollment data comes fromtransactional data warehouses. Each claim, consists of sev-eral data elements with information about the beneficiary,provider, the health condition (or diagnosis), the service pro-vided (procedure or drug), and the associated costs. Figure3 shows the different entities and their relationships thatare present in the healthcare claims data. Note that theproviders typically are affiliated to each other through orga-

5https://oig.hhs.gov/exclusions/exclusions_list.asp

1314

nizations such as hospitals. This information and additionaldata about the providers is present in the provider data.

6. LARGE SCALE TEXT ANALYTICSSimply stated, the two ultimate goals necessary to ad-

dress rising healthcare costs are: 1). a healthy population,and 2). optimal healthcare in terms of cost and quality. Toreach these goals, the first vital step is to understand thecurrent landscape in terms of prevalent diseases and the re-sulting treatments and costs. Identifying the key diseaseprofiles for patients will allow segmentation of the popula-tion into groups which can then be targeted for proactivecare or lifestyle changes. For providers, typical treatmentprofiles used by doctors and hospitals will be instrumen-tal in identifying the costly areas which need to addressedthrough policy changes or medicinal research. Moreover,such profiles can also be used to compare providers acrossthe country and potentially across organizations to identifyfraudulent (upcoding) or wasteful providers.

In the first case study, we show how such profiles can begenerated from the claims data (See Figure 3) using ad-vanced text analytic solutions [18]. Text data, especiallyfrom the web domain, has been the foremost target of the bigdata paradigm and a host of open-source solutions (ApacheHadoop based Mahout library [8], MADlib for parallel databases[9]) exist for deploying text mining algorithms on massivetext data sets. The interaction between text mining andhealthcare, for obvious reasons, has been in analyzing thetext available within the patient health records (clinical data)[25]. But these solutions have never been applied in the con-text of healthcare claims data.

6.1 Representing Entities as DocumentsTo capture the behavior profiles of the providers and ben-

eficiaries we construct several sparse matrices from a tem-poral aggregate of claims data, as follows:

Let the set of providers be denoted as P , set of benefi-ciaries be denoted as B, set of procedures as C, set of di-agnoses as G, and set of drugs as D. Let the symbol XYdenote a matrix with X ∈ {B,P} representing rows andY ∈ {G,C,D} representing columns. For example, the ma-trix PG (providers vs. diagnoses) captures the nature ofdiagnoses that a doctor assigns to patients. Each cell PCij

in the matrix PC denotes the number of times the providerPi ∈ P uses the procedure Ci ∈ C in the given time frame.

Each of matrices thus created can be viewed as a document-term matrix with providers/beneficiaries as documents anddrugs/procedures/diagnoses as terms. Such representationopens up the claims data to a wide spectrum of existing textanalysis methods [18]. Given that multiple document-termmatrices can be generated for the same entity (providers),it also allows application of methods that deal with learningfrom multiple views [16].

6.2 Profiling ProvidersIn this study we used topic modeling using Latent Dirich-

let Allocation (LDA) [5] as our text analysis tool. LDA is aprobabilistic topic model which is widely used for determin-ing hidden topics in a set of documents. In the LDA model,each document (provider) is represented as a mixture of afixed number of hidden topics and each topic is a probabilitydistribution over a vocabulary of words (diagnosis codes).

Figure 4: Relative proportion of fraudulent and non-fraudulent providers assigned to each topic of diag-nosis codes.

We applied the Mahout implementation of collapsed vari-ational Bayesian inference (cvb) algorithm [28] for LDA onthe provider-diagnoses (PG) matrix constructed from a yearof claims data (361,117 providers, 9,378 diagnosis codes,43,331,004 tuples). The matrix was normalized using tfidfnormalization. Using LDA we identified 20 hidden topicsfrom the PG matrix.

From healthcare perspective, each topic can be thoughtof as a category for providers. While one would expect thattopic driven categorization should closely match the actualspecializations of the providers, we found out that this wasnot always the case. While some topics were dominated bydiagnoses that belonged to same area of medicine (e.g. , on-cology, ophthalmology), there were other topics which weremade up of diagnosis codes that are seemingly different fromeach other. For example, one topic consisted of diagnosiscodes associated with Diabetes as well as dermatoses (skincondition). While this might appear surprising, further re-search revealed a medicinal connection between the two [17].

Another possible application of the topic modeling is touse the topic distributions as features or profiles for theproviders. To validate the discriminatory potential of topicdistributions we conducted the following analysis. For eachprovider, we chose the topic with highest probability as-signed by LDA. We then compared the proportion of fraudu-lent and the non-fraudulent providers (using the list of fraud-ulent providers as described in Section 5) that fall under eachtopic. The relative proportions are shown in Figure 4.

The non-fraudulent providers are evenly distributed acrossmost topics, except for the topic 19. This particular topicis composed of “generic” diagnoses such as follow-up surgeryand benign tumor and hence is expected to represent a largenumber of providers. For the fraudulent providers, the dis-tribution follows the non-fraudulent providers, except fortopic 5, which represents close to 25% of all fraudulent providersas opposed to only 7% of non-fraudulent providers. Topic 5is dominated by very distinct diagnosis codes.

From the domain perspective, this discovery is valuablesince it identifies the diagnosis codes that are used by sameproviders and have historically been targets of fraud. Giventhe fact that the codes in topic 5 are medicinally distinctfrom each other, such discovery can only be made by meth-ods that allow all diagnosis codes to be related to each other,i.e. , topic models.

The promising findings in Figure 4 show that analyzingclaims data using text mining methods can reveal very in-

1315

teresting interactions and patterns. Similar analysis can beconducted for other matrices for additional insights.

7. SOCIAL NETWORK ANALYSISAs health insurance companies shift focus from fraud de-

tection to fraud prevention, building a predictive model toestimate the risk of a provider before making anyclaims has been a challenging problem. Furthermore, sub-stantial amount of healthcare fraud is expected to be hiddenin the relationships among providers and between providersand beneficiaries making insurance claims. In this case study,we present results of applying social network analysis meth-ods [21] to understand the relationships of providers in thehealthcare system and visualizing features and patterns offraudulent behaviors in such a network. We describe theconstruction of social-network features and the predictivemodel built on those features as a solution to assessing health-care fraud risk at the time of enrollment.

7.1 Constructing a Provider Social Network

Individual

Providers

Organizations

Figure 5: A Sample Provider Network

Providers in the US healthcare system are typically asso-ciated with multiple hospitals and health organizations. Theinformation about the providers can be obtained from mul-tiple sources. Some of such data sources are public6 whileothers may be purchased7. We use data from such sourcesto construct a social network in which providers (both in-dividual and organizations) are the nodes. The edges arebetween individual and organization nodes (See Figure 5).A graph when constructed for all providers in the UnitedStates is expected to have nearly 35 million nodes and morethan 100 million edges. In this study we construct a graphfor providers in the state of Texas with almost 1 millionnodes and close to 3 million edges.

7.2 Properties of the Provider NetworkA snapshot of the provider network for the state of Texas

is shown in Figure 6. This provider network is differentfrom a typical “social network” in the following ways: (i)the networks consist of both organizations and individuals.This introduces a latent hierarchy in the network becauseseveral individual physicians work for organizations and canalso own group practices, (ii) the network is a collection of

6https://nppes.cms.hhs.gov/NPPES/7http://www.healthmarketscience.com/

Figure 6: Snapshot of the provider network forTexas. The width of circle at each node denotesthe number of affiliations. The large circles indicateorganizations, such as hospitals. Nodes in red arefraudulent providers.

disconnected graphs, the largest network being a networkof a few 100,000 providers and the smallest as little as 3,and, (iii) the network is constructed based on self-reporteddata and inferred data based on subject matter expertiseand may be subject to omissions, errors and quality issues.

7.3 Extracting Features from Provider NetworkFor this particular study, we focus on analyzing the net-

work properties with respect to the fraudulent providers de-scribed in Section 5. The objective is to extract multiple fea-tures for every node in the network and use them for discrim-inating between fraudulent and non-fraudulent providers.We investigated several network based features [21, 23] aslisted in Table 1.

Centrality Degree, Closeness, Betweenness, LoadCurrent Flow Closeness, CommunicabilityCurrent Flow Betweenness, Eigenvector

Assortativity Average Neighbor DegreeAverage Degree Connectivity

Clustering Triangles, Average ClusteringSquare Clustering

Communities K-CliqueComponents Connectivity

Cores Core number, k-CoreDistance Measures Center, Diameter, Eccentricity

Periphery, RadiusFlows Network Simplex

Link Analysis PageRank, HitsRich Club Rich Club Coefficient

Shortest Paths Shortest PathVitality Closeness Vitality

Table 1: Network features studied for the providernetwork [21].

7.4 Relevance for Identifying FraudFor each feature we estimated its capability to distinguish

between fraudulent and non-fraudulent nodes using the In-formation Complexity (ICOMP) measure [15] which com-pares the distribution of the features for the fraudulent andnon-fraudulent populations. The five network based featuresthat we found to be most distinguishing were: Node de-gree, Number of fraudulent providers in 2-hop net-work, Page rank, Eigenvector centrality, and Current-

1316

flow closeness centrality. Figure 7 shows the distribu-tion of each feature with respect to the fraudulent and non-fraudulent populations.

For instance, the red line in 7(a) indicates the node degreedistribution for providers previously identified as fraudulent.The blue lines are for a random sample of non-fraudulentproviders. We observe that increase in degree of providercorrelates to a higher risk of fraud. Similar conclusionscan be drawn from analyzing the 2-hop network (See Fig-ure 7(b)). In fact, the chance of finding other fraudulentproviders within the 2-hop network of a fraudulent provideris ∼40% compared to the chance of finding a fraudulentprovider within the 2-hop network of a random provider(∼2%).

Given the ability of the above features to distinguish be-tween fraudulent and non-fraudulent providers, we plan toutilize them within either an unsupervised multivariate anomalydetection algorithm [7] for automatic detection of such providersor in a binary classification algorithm that learns from theavailable labeled data.

8. TEMPORAL ANALYSIS OF CLAIMS SE-QUENCES

In the context of identifying healthcare fraud perpetratedby providers, two generic response mechanisms are possible:1). identify and prosecute providers after claims are submit-ted (pay and chase), and 2). timely denial of payment fora submitted claim based on the associated risk. Whereaslegal prosecution is expensive, time-consuming and difficultto wage, a policy of selective denial of payment based onstatistical risk factors is much easier to implement once therisk estimation algorithms have been developed.

In this case study we use temporal analytics to address thefollowing two questions: i). How can we identify the transi-tion of a good provider into a bad actor in an online fashionusing the temporal sequence of claims?, and ii). how canthe temporal sequence be used to discriminate fraudulentproviders from others?

We pose the first question as a change-point detectionproblem and employ a statistical process control methodol-ogy to identify the transition. The strength of this methodis that it can be implemented online to examine each claimas it enters a processing queue for payment. For the secondquestion we compare the temporal claim submittal patternsof every provider to estimated population norms for similarproviders (e.g., by speciality and geographic location) anddefine features from these comparisons. Classifiers can sub-sequently be trained to learn the differences between knownfraudsters and presumed normal providers.

8.1 Change-point Detection with Statistical Pro-cess Control Techniques

The statistical process control (SPC) literature [20] hasevolved into a fairly mature technology for implementationof a temporal approach to processing data sequences.

A number of methods have been proposed in SPC the-ory to monitor processes for exceedance of control limits.One popular metric is the cumulative sum (CUSUM) statis-tic [22]. Here we illustrate an application of CUSUM to iden-tify changes in the patient enrollment. A useful assumptionin this approach is that a fraudulent providers often start“taking” more patients than usual [27].

Suppose we have a time-ordered sequence of claims X =x1, x2, . . . , xn. This sequence could represent all insuranceclaims submitted by a single provider over a fixed time in-terval (e.g., one year). One of the simplest SPC statistics isthe Bernoulli CUSUM [26], where X is simply a vector ofzeros and ones. For example, we can define xi according tothe following:

xi =

{1 if the ith claim has a new beneficiary number0 otherwise

(1)

This vector tracks the introduction of new beneficiaries inthe stream of claims submitted by a specific claimant, andprovides a basis for estimating whether a large number ofnew beneficiaries were seen by a provider during a particulartime interval. The Bernoulli CUSUM statistics to analyzethis vector are:

St = max (0, St−1 + Lt), t = 1, 2, . . . , (2)

where S0 = 0 and the chart signals if S0 > h. The values ofthe log-likelihood scores are

Lt =

ln(

1−p11−p0

)if Xt = 0

ln(

p1p0

)if Xt = 1

(3)

A more common form of fraud occurs when providers starttaking patients with conditions different from their past pro-file. Given that the condition codes can have multiple cate-gories, the above method needs to be generalized to a multi-nomial case. The multinomial CUSUM statistic [13] can beapplied here as follows:

Lt = ln

(pi1pi0

)when Xt = i (4)

Where pi1 is the ith alternative hypothesis, and pi0 is the ith

null hypothesis. A typical multinomial/categorical CUSUMfor a“presumed normal”is shown in Figure 8(a). The CUSUMstatistic spikes when the provider uses different conditioncodes than the typical, but falls back to 0 since the atypicalbehavior is sporadic. On the other hand, Figure 8(b) showsthe CUSUM statistic for an unusual (potentially fraudulent)physician who uses many condition codes that are not typ-ical for his speciality; the CUSUM statistic captures thisunusual behavior.

A complete set of out-of-control probabilities are selectedfor the multinomial CUSUM. In the absence of a specificalternative hypothesis, a simple method of formulating theout-of-control probabilities is to assume that every proba-bility reverses direction to become less extreme, i.e., prob-abilities migrate in the direction of the grand mean. Wespecify a proportional change constant to compute the ex-act probabilities. The alternative hypothesis p0 is thenp1 = p0 + c ∗ (m−p0) where c is the proportional constantand the mean m is simply the reciprocal of the number ofcategories.

8.2 Anomaly Detection using CUSUM Statis-tic

One promising metric for screening anomalies is the max-imum value of a CUSUM statistic over a fixed time interval.However, this statistic is biased by the number of claimssubmitted by a provider during that interval. Some out-lier providers exhibit continuously anomalous behavior, even

1317

(a) Node Degree (b) Fraud in 2-hop Network (c) Page Rank

(d) Eigenvector Centrality (e) Current-flow Closeness Centrality

Figure 7: Distribution of top distinguishing features for fraudulent (red) vs. non-fraudulent providers (blue).

0 20 40 60 80 100

20

04

00

60

08

00

10

00

Time

Dia

gn

osis

Co

de

0.0

0.2

0.4

0.6

0.8

1.0

Mu

ltin

om

ial C

US

UM

Diag−Lo

Diag−Hi

multiCUSUM

(a) Normal Provider

0 200 400 600 800

30

04

00

50

06

00

70

08

00

Time

Dia

gn

osis

Co

de

02

04

06

08

01

00

12

0

Mu

ltin

om

ial C

US

UM

Diag−Lo

Diag−Hi

multiCUSUM

(b) Potentially Fraudulent Provider

Figure 8: Multinomial CUSUM chart to track time ordered condition codes from insurance claims. Greensquares indicate the typical codes and blue squares indicate the atypical codes for the given speciality.

over a large time interval, and their CUSUM statistics of-ten resemble a linear function. This pattern suggests an-other metric that is not similarly biased by the length of theclaims sequence - the average CUSUM rate. Figure 9 showsa scatterplot using these metrics for a typical provider pop-ulation color-coded by speciality. The scatterplot displaysa cornucopia-shaped pattern with the tail originating at thelowest CUSUM values, and the mouth arcing upward to-ward the highest CUSUM rates. Anomalies are separatedfrom the main cluster near the top of the scatterplot. Fur-ther analysis is required to determine whether these anoma-lies are normal in a statistical sense, or whether they aremore likely members of an anomalous cluster. A horizontalline boundary is drawn at a CUSUM rate of about 0.35 tosuggest a possible division of outliers from normals.

8.3 Temporal Feature ConstructionThe previous section explored the possibility that a provider

presumed to be normal up to some position within a tempo-ral claims sequence is diverted either temporarily or perma-nently to anomalous behavior. Additionally we expect thatsome providers are either fraudsters from the moment they

Figure 9: Distribution of CUSUM metrics for aprovider population.

1318

enroll in an insurance program, or that they revert to fraudu-lent activity permanently at some time before the beginningof a limited claims sequence. In this case the availabilityof provider ground truth affords the analyst an opportunityto go beyond anomaly detection as a method of identifyingpotential bad actors. In particular, if every provider canbe labeled as either “bad actor” or “presumed normal”, wecan discriminate between temporally stable characteristicsof normal and bad actors. Here we assume that providersnaturally cluster into a main normal group and a main badactor group. This assumption may only be approximatelycorrect, especially for the class of bad actors if multiple pathsexist to fraudulent behaviors.

In this section we extract 10 temporal features for eachprovider based on their submitted claims. Given the factthat some of these features might“help”real providersto adapt and avoid future identification, we are notdisclosing the actual features in this paper. In general,we consider a set of temporally stable features are definedover observed fields in a claims sequence. Features may ei-ther be direct functions of elements of the claims sequenceor goodness-of-fit statistics that compare empirical distribu-tions to normative distributions. While some features areconditioned on the provider speciality (denoted as Spec fea-tures) other are independent (NonSpec features).

To assess the value of such features, we train two weightedbinary logistic regression classifiers [3] for the Spec and Non-Spec feature sets, respectively. We use a labeled training setwith 8557 instances constructed using several million Medi-caid claims. Less than 1% of the labeled set contained 1% ofknown fraudulent providers using the fraud data discussedin Section 5. Since the bad actor group was comparativelysmall, and because we had greater confidence in their la-bels, the logistic regression was weighted toward the badactor group by a 10:1 ratio.

We calculated the sensitivity and specificity of both classi-fiers along with the area under the curve (AUC). Figure 10shows that including speciality in the model significantlyimproved performance (DeLong’s test; Z=7.22, p<.0001) ofthe Spec model over the NoSpec model, yielding an AUC of0.814.

Specificity

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

1.0 0.8 0.6 0.4 0.2 0.0

With Specialty

Without Specialty

AUC=.814

AUC=.716

Figure 10: ROC curves for Spec and NoSpec classi-fiers using weighted logistic regression

Thus we show that supervised learning methods can beeffective in discriminating between bad actors and those pre-sumed normal if the features are well-designed and norma-

tive data is available. An immediate next step is to trysemi-supervised methods on provider feature vectors so thatproviders that are not labeled “fraudulent” can be treated asunlabeled.

9. CONCLUSIONS AND FUTURE DIREC-TIONS

This paper showcases the relevance of advanced big dataanalytics in the emerging domain of healthcare analytics us-ing claims data. Our main contribution is the translation ofsome of key challenges faced by the healthcare industry asknowledge discovery tasks. The three case studies presentedin this paper attack the problem of identifying fraudulenthealthcare providers in three independent ways, using stateof art KDD methodologies, which have never been previ-ously used in this context. Our results on real fraud datahighlight the promise that advanced data analytics hold inthis important domain. The analysis for these case studieswas conducted using the Hadoop/Hive data platform andused open source software such as Mahout, R, and Pythonnetworkx8 and hence are repeatable in other contexts.

In each case study we identified the potential and chal-lenges associated with the existing analytic solutions. Treat-ing providers and beneficiaries as text documents opens thepossibility of using the vast text mining literature and thehighly sophisticated text mining tools that have been devel-oped specifically for big text data, and can lead to valuablediscoveries as shown in Section 6. Studying affiliations be-tween providers as social networks is valuable, given thatorganized fraud is rampant in the healthcare system, andcan be identified by analyzing the relationships between theproviders using network science methods. However, we iden-tified certain differences between the provider network anda traditional “social network” which researchers should bearin mind before applying these methods. Temporal analysismethods are also useful because they do not require trainedclassifiers to identify anomalies, and are sensitive enoughto be employed as timely online techniques for detection oftransient billing practices that are anomalous. In future, weintend to combine the features generated from each of thecase studies in a multi-view learning framework to betteridentify fraudulent providers.

An important conclusion from these analyses is that whileinsurance claims data are typically considered as paymentrecords, they contain valuable information that can be usedto answer many other healthcare related questions. For in-stance, studying topics of diagnosis or drug codes (see Sec-tion 6) can be done in the context of beneficiaries to under-stand the major behavior modes of the population in termsof health indicators. Networks that capture interaction be-tween beneficiaries and providers can be constructed fromclaims data and can be used in conjunction with the providernetwork to better understand the healthcare system.

10. ACKNOWLEDGEMENTSPrepared by Oak Ridge National Laboratory, P.O. Box

2008, Oak Ridge, Tennessee 37831-6285, managed by UT-Battelle, LLC for the U. S. Department of Energy undercontract no. DEAC05-00OR22725.

8http://networkx.github.com/

1319

11. REFERENCES[1] National health expenditure projections 2009–2019.

Center for Medicare and Medicaid Services, 2010.

[2] World health statistics. WHO LibraryCataloguing-in-Publication Data, 2011.

[3] A. Agresti. Categorical Data Analysis, chapter 5.Wiley-Interscience, Hoboken, New Jersey, secondedition, 2002.

[4] E. Aleskerov, B. Freisleben, and B. Rao. Cardwatch:A neural network based database mining system forcredit card fraud detection. In Proceedings of IEEEComputational Intelligence for Financial Engineering,pages 220–226, 1997.

[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latentdirichlet allocation. J. Mach. Learn. Res., 3:993–1022,Mar. 2003.

[6] Y. M. Chae, et al. Data mining approach to policyanalysis in a health insurance domain. InternationalJournal of Medical Informatics, 62(2-3):103–111, 2001.

[7] V. Chandola, A. Banerjee, and V. Kumar. Anomalydetection – a survey. ACM Computing Surveys, 41(3),July 2009.

[8] C. T. Chu, et al. Map-Reduce for Machine Learningon Multicore. In B. Scholkopf, J. C. Platt, andT. Hoffman, editors, NIPS. MIT Press, 2006.

[9] J. Cohen, et al. Mad skills: new analysis practices forbig data. Proceeding of VLDB Endowment,2(2):1481–1492, Aug. 2009.

[10] T. Fawcett and F. Provost. Activity monitoring:noticing interesting changes in behavior. InProceedings of the 5th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,pages 53–62. ACM Press, 1999.

[11] A. M. Garber and J. Skinner. Is american health careuniquely inefficient? Working Paper 14257, NationalBureau of Economic Research, August 2008.

[12] R. Ghani and M. Kumar. Interactive learning forefficiently detecting errors in insurance claims. InProceedings of the 17th ACM SIGKDD, pages325–333. ACM, 2011.

[13] M. HAuhle. Online change-point detection incategorical time series. In T. Kneib and G. Tutz,editors, Statistical Modelling and RegressionStructures, pages 377–397. Physica-Verlag HD, 2010.

[14] L. Kenworthy. AmericaaAZs inefficient health-caresystem: another look.http://lanekenworthy.net/2011/07/10/americas-inefficient-health-care-system-another-look/,2011.

[15] S. Konishi and G. Kitagawa. Information Criteria andStatistical Modeling (Springer Series in Statistics).Springer, 2007.

[16] B. Long, P. S. Yu, and Z. Zhang. A General Model forMultiple View Unsupervised Learning. In Proceedingsof the SDM, 2008.

[17] C. B. Lynde and M. D. Pratt. Acquired perforatingdermatosis: association with diabetes and renalfailure. Canadian Med ‘Association Journal, 181(9),2009.

[18] C. D. Manning, P. Raghavan, and H. Schtze.Introduction to Information Retrieval. CambridgeUniversity Press, New York, NY, USA, 2008.

[19] J. Manyika, et al. Big data: The next frontier forinnovation, competition, and productivity, May 2011.

[20] D. C. Montgomery. Introduction to Statistical QualityControl. John Wiley and Sons, 4th. edition, 2001.

[21] M. Newman. Networks: An Introduction. OxfordUniversity Press, Inc., New York, NY, USA, 2010.

[22] E. S. Page. On problems in which a change can occurat an unknown time. Biometrika, 44(1-2):248–252,1957.

[23] L. Page, S. Brin, R. Motwani, and T. Winograd. Thepagerank citation ranking: Bringing order to the web.Technical Report 1999-66, Stanford InfoLab, 1999.

[24] S. H. Preston and J. Ho. Low life expectancy in theUnited States: Is the health care system at fault? InNational Research Council (US) Panel onUnderstanding Divergent Trends in Longevity inHigh-Income Countries. 2010.

[25] U. Raja, T. Mitchell, T. Day, and J. M. Hardin. Textmining in healthcare. Applications and opportunities.Journal of healthcare information management :JHIM, 22(3):52–56, 2008.

[26] M. Reynolds and Z. Stoumbos. A cusum chart formonitoring a proportion when inspecting continuously.Journal of Quality Technology, 1999.

[27] M. Sparrow. License to steal: how fraud bleedsAmerica’s health care system. Westview Press, 2000.

[28] Y. W. Teh, D. Newman, and M. Welling. A collapsedvariational bayesian inference algorithm for latentdirichlet allocation. In NIPS, 2006.

[29] S. Wasserman and K. Faust. Social Network Analysis.Methods and Applications. Cambridge UniversityPress, 1994.

1320