classifying readmissions of diabetic patient encounters

Advanced Predictive Modeling Fall 2015 Term Paper

Classifying Readmissions based on parameters of a Diabetic patient encounter

by Jace Barton, Zahref Beyabani, Anagh Pal, Dhwani Parekh, and Mayur Srinivasan

Abstract

Readmission rates in hospitals are a key indicator on quality of patient care and a clear

indication of total cost or inconvenience related to the treatment. Patients with serious medical

conditions such as diabetes mellitus are key drivers of readmission rates owing to the complexity

of their illness. Therefore, being able to predict based on certain features whether or not a patient

will need readmission can help doctors and hospitals provide better care initially and not get

financially penalized under Obamacare’s readmission policy. Our project explores various

classification techniques to achieve a prediction of whether or not a patient will be readmitted.

The results obtained are not too promising however, as evidence towards the predictors available

not encompassing the true cause of readmission keeps presenting itself throughout our

endeavors.

Introduction

National Diabetes Statistics Report[1], 2014 by Centers for Disease Control and

Prevention mentions that 29.1 million people or 9.3% of the U.S. population have diabetes out of

which 8.1 are undiagnosed. A separate study[2] cited by the WHO states that these numbers are

only going to increase and a systemic but controllable disease like Diabetes is going to be the 7th

leading cause of death by 2030. Estimates show that $245 billion is spent in the US on diabetes

related ailments as a part of direct and indirect costs(includes disability, work loss and premature

death). A hospital’s readmission rate is a significant and a direct contributor to total medical

expenditure and is an apparent indicator of the quality of care provided. Chronic, debilitating,

and severe medical conditions such as diabetes are connected with higher risks of hospital

readmission. Our project’s aim was to utilize the diabetes dataset available on the UCI machine

learning repository, and apply concepts learnt in class to determine if a prediction on whether or

not a patient would be readmitted could be made based on data collected on in-patient diabetic

encounters. From a treatment perspective, being able to tell whether or not a patient would be

readmitted from the start could mean a better focus on the patient to prevent the readmission,

consequently resulting in a better health outcome. From a fiscal standpoint, it is cheaper for both

the patient and the hospital if the readmission rate is low, as the hospital does not need to expend


already scarce capacity multiple times, if they can succeed in appropriate treatment the first time.

Finally, from an Obamacare compliance outlook, it is better for the hospital to have low

readmission rates as penalties exist for hospitals with high readmission rates. We performed

some basic data treatment, and proceeded to split it into a training and test set before proceeding

with the application of predictive modeling.

Data and Pre-Processing

This original data used the Health Facts database (Cerner Corporation, Kansas City, MO),

a national data warehouse that collects comprehensive clinical records across hospitals

throughout the United States. This contained data systematically collected 1from participating

institutions electronic medical records and includes encounter data (emergency, outpatient, and

inpatient), provider specialty, demographics (age, sex, and race), diagnoses and in-hospital

procedures documented by ICD-9-CM codes, laboratory data, pharmacy data, in-hospital

mortality, and hospital characteristics. However, all data were de-identified in compliance with

the Health Insurance Portability and Accountability Act of 1996. The data spanned 10 years

(1999–2008) of clinical care at 130 hospitals and integrated delivery networks throughout the

United States: Midwest (18 hospitals), Northeast (58), South (28), and West (16).

The data available (which was used) described unique individual patient encounters that resulted

in a primary, secondary or additional secondary diagnosis as being diabetic related. Information

on patient demographics such as age, race, and gender were available, in addition to their test

results, medications, and finally what we picked as a response variable: whether or not the

particular patient was readmitted. The dataset subscribes to the following assumptions:

- It was an inpatient (Hospital admission) encounter and not just a physician / clinic visit

- The encounter was a strictly in-patient encounter lasting between one and fourteen days

- The encounter resulted in an ICD9 code pertaining to diabetes to be entered as a primary,

secondary or additional secondary diagnosis.

- One or more laboratory tests were performed during this patient’s admission and/or stay [1] http://www.cdc.gov/diabetes/pubs/statsreport14/national-diabetes-report-web.pdf [2] http://www.who.int/mediacentre/factsheets/fs312/en/


- Medications were administered / prescribed during the encounter.

Some data preprocessing was required to align the data with our aims for this project, the main

one being coding the response variable to match the stipulation of the Obamacare readmission

penalty policy. The policy defines a readmission as one where the patient was readmitted to the

hospital within 30 days of the first admission. The dataset consisted of more granular data

describing readmissions; hence we simplified it to indicate whether or not a readmission

occurred within 30 days. Other aspects of preprocessing were to scale numeric variables to have

a zero mean and unit variance, and imputing the patient race when it was not available utilizing

previous encounter data. Additional pre-processing or re-encoding of certain variables was

required for particular models, and the measures undertaken are described in the specific sections

of this paper that pertain to those models.

Exploratory Data Analysis

Prior to performing any statistical modeling or training classifiers, it is a good idea to get

to know and get familiarized with the underlying data. Included below are some of our

exploratory data analysis that details what kind of data exists, and some preliminary insights.


Fig 1. Data Description


Fig 2. Response variable coding

Fig 3. Encounters by age

Fig 4. Patient Demographics


Fig 5. Number of days spent as in-patient by encounters

Association Analysis of Diagnosis Feature for Level Reduction

One of the challenges that we faced quite early into our model development was the

number of levels for certain features. Features with a very high number of levels led to a

significant increase in computation time as well as a huge loss in interpretability of the model.

This resulted in important features being sidelined and biased, while the model results were far

from explicable. We also noticed that many of the models developed were susceptible to the

number of levels in the feature space, and hence reducing the number of levels for certain

features became an operational requirement for us. Major features that exhibited this problem

were ‘medical speciality’ (with over 80 levels) and ‘diagnosis’ variables (with over 800 levels

overall)

The levels of ‘medical speciality’ feature were reduced manually, using our best judgement of

medical specialties that are similar or are highly under-represented. For example, the following

set of specialties were grouped under one ‘Pediatrics’ specialty

Similarly, we found quite a few groups within ‘Surgery’ and ‘Obsterics’. Hence, we were able to

reduce the number of levels for ‘medical specialty’ to around 27 levels

PediatricsPediatrics-‐AllergyandImmunologyPediatrics-‐CriticalCarePediatrics-‐EmergencyMedicinePediatrics-‐EndocrinologyPediatrics-‐Hematology-‐OncologyPediatrics-‐InfectiousDiseasesPediatrics-‐NeurologyPediatrics-‐Pulmonology


We were faced with a much bigger challenge with respect to the ‘diagnosis’ feature as explained

below.

Diagnosis Variables

The diagnosis variables in our data were spread across three features:

The Primary diagnosis, Secondary diagnosis, Additional Secondary Diagnosis (coded as first

three digits of ICD9)

Even though all the data we have here is a diabetes related encounter, diabetes may not be the

primary, secondary, or additional secondary diagnosis in most encounters. These are the cases

where diabetes was one of the many diagnoses, but do not feature in the ‘Top 3’

The challenge with using the diagnosis variable as-is, is that across primary, secondary, and

additional secondary diagnoses, we have close to 800 unique levels. This leads to a breakdown

of the models and major loss of interpretability as discussed earlier

Options

With the given scenario, we approached the problem of high ‘vertical’ dimensionality in 2

ways:

1. Convert diagnosis to a binary feature

We could convert the diagnosis variables where diabetes (ICD9 code = 250) is either primary,

secondary, or additional secondary to a ‘positive class’ of ‘Diabetes Important Diagnosis’. The

negative class will comprise of encounters where diabetes does not feature in any of the three

important diagnosis variables

One of the downfalls of this approach is the loss of valuable information in the form of diagnoses

other than diabetes. We have observed empirically that certain lifestyle diseases occur in groups,

and hence we should ideally move towards an approach that takes this inherent association into

consideration

2. Convert diagnosis into a metric measuring importance of diabetes (including association)

As discussed before, the association between diagnosis is an important factor that we should

ideally be incorporating as we resolve the high dimensionality issue with the diagnosis variables.

Primary Diagnosis Secondary Diagnosis Additional Secondary Diagnosis


We now model the diagnoses through Association Analysis. We use the ‘arules’ and ‘arulesViz’

package in R to setup the model and subset only the primary, secondary, and additional

secondary diagnoses for this exercise.

Given the sparsity of the data based on the high number of unique levels of the variables, we

start with a low support value to generate length-3 rules.

We are primarily interested in the rules

with antecedents of diabetes, as we want to

generate a metric measuring the

occurrence and importance of diabetes.

We then divide the rules by their

antecedents, as there exists an inherent

level in our features as follows:

Primary Diagnosis > Secondary Diagnosis

> Additional Diagnosis

This will also be helpful to determine the final metric, as the rule with an antecedent involving a

Primary Diagnosis should ideally be weighted higher than the rule with an antecedent involving

a Secondary Diagnosis.


diag_1 diag_2 diag_3 Antecedent Primary Lift X 5 Antecedent Secondary Lift X 3 Antecedent Additional Secondary Lift X 1 MaxLift MaxLift_250Correction250 {diag_2=,diag_3=} 46.44276 {diag_1=250,diag_3=} 0 {diag_1=250,diag_2=} 46.44276 75276 250 255 {diag_2=250,diag_3=255} 0 {diag_1=276,diag_3=255} 0 {diag_1=276,diag_2=250} 0.3064221 0.3064221 45410 427 428 {diag_2=427,diag_3=428} 2.2124345 {diag_1=410,diag_3=428} 0 {diag_1=410,diag_2=427} 0.5165533 2.2124345 2.2124345410 401 582 {diag_2=401,diag_3=582} 0 {diag_1=410,diag_3=582} 0 {diag_1=410,diag_2=401} 2.9657283 2.9657283 2.9657283786 427 278 {diag_2=427,diag_3=278} 0 {diag_1=786,diag_3=278} 11.5051782 {diag_1=786,diag_2=427} 1.1371186 11.5051782 11.5051782648 250 V27 {diag_2=250,diag_3=V27} 0 {diag_1=648,diag_3=V27} 21.0552282 {diag_1=648,diag_2=250} 21.0552282 458 250 403 {diag_2=250,diag_3=403} 0 {diag_1=8,diag_3=403} 0 {diag_1=8,diag_2=250} 0 45

For each of the antecedent classes, we then sort the rule by their lifts, as it determines the

strength of the association. We also use the lift to determine our final metric, as a stronger

association between the precedent and antecedent should be weighted higher.

Top 5 rules for each of the antecedent classes

Primary Diagnosis is Diabetes

Secondary Diagnosis is Diabetes

Additional Secondary Diagnosis is Diabetes

We

then

apply the appropriate weights in 5:3:1 proportion, for the three antecedent classes on the lifts that

each of the rules generate. For every ‘transaction’ (encounter in this case), we then have the

maximum lift that can be generated by various rules that apply to it. This maximum lift value is

the final metric that signifies the importance of diabetes in the overall diagnosis of the patient,

which also incorporates the association between diagnoses. We also account for the cases where

diabetes does exist as a primary, secondary, or additional secondary diagnosis, where we

introduce a metric floor and ceiling at [15 , 75]


Cost Penalty Matrix for Readmission from Diabetic Encounters

For a binary classifier like most of the models we’ll develop, the error or

misclassification can be very costly. The cost of misclassification is not incorporated in the

model development in its general form, but the impact of cost can be big enough to influence the

parameter thresholds significantly. As such, the incorporation of misclassification costs changes

the classification process into an optimization problem, wherein the cost minimization is the

objective, and model accuracy is a consequence of this optimization.

For our problem, the cost of misclassification can be illustrated as below:

Let’s explore each of the costs one-by-one:

1. Benefit of predicting a right readmission

Most U.S. hospitals will get less money from Medicare in fiscal 2016 because too many patients

return within 30 days of discharge. The readmissions program, created under the Affordable

Care Act, initially evaluated how often patients treated for heart attack, heart failure and

pneumonia had to return to the hospital within 30 days of discharge. Facilities with too high a

readmission rate saw their Medicare payments docked up to 1% in fiscal 2013. The financial

stakes increased to a 2% reduction in fiscal 2014[3]

Hence, assuming that a hospital has the resources to reduce or control readmission based on an

accurate prediction of readmissions of its patients, we can say that hospitals can save an amount

equivalent to the Medicare penalty discussed above. To do this, we find the value of the penalty

through the average Medicare coverage per capita below:

𝑉𝑎𝑙𝑢𝑒 𝑜𝑓 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 𝑆𝑎𝑣𝑒𝑑 =𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑑𝑖𝑐𝑎𝑟𝑒 𝑆𝑝𝑒𝑛𝑑𝑖𝑛𝑔 ($)𝑀𝑒𝑑𝑖𝑐𝑎𝑟𝑒 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒 (𝑢𝑛𝑖𝑡𝑠) ∗ 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 (%)

From [1], Total Medicare Spending = $505 billion and from [2], total Medicare coverage = 5.23

million. We also saw above that the Penalty imposed is up to 1% of the Medicare $ value

Therefore,

Readmission No Readmission

Readmission1. Benefit of predicting a right

readmission

2. Cost incurred by hospital when the model predicts 'no readmission' but a

patient is readmitted

No Readmission3. Cost incurred when a patient isn't readmitted even though the model predicts s/he will

4. Benefit of predicting a right 'no readmission'

PREDICTED CLASS

ACTU

AL CLASS

COST PENALTY MATRIX


𝑉𝑎𝑙𝑢𝑒 𝑜𝑓 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 𝑆𝑎𝑣𝑒𝑑 = −$𝟗𝟔.𝟓𝟔 (negative because this is a benefit, and not a cost)

2. Cost incurred by hospital when the model predicts 'no readmission' but a patient is

readmitted

The crux of this entire exercise is to reduce this particular cost – when a model predicts that the

customer is very unlikely to be readmitted, but a patient does have a readmission encounter. This

incurs a huge cost for the hospital, as a) the readmission is unplanned and expensive b) it denies

medical care to other unplanned patients where there was no way to predict the admission.

Readmission costs are well documented as it is one of the largest sources of avoidable risk/cost

for hospitals. For our problem, we can restrict ourselves to the readmission costs of diabetic

encounters, as they constitute about 4% of readmissions [4] that make up the average

readmission costs in the report. We can define the False Negative Class rate as follows:

𝐶𝑜𝑠𝑡 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 "no readmission" 𝑤ℎ𝑒𝑛 "𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛"

= 𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑠𝑡 𝑜𝑓 𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠 𝑅𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛

𝑇𝑜𝑡𝑎𝑙 𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠 𝑅𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛𝑠

From [4], we can plug-in the values to find the following value of misclassification cost

𝐶𝑜𝑠𝑡 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 no readmissio 𝑤ℎ𝑒𝑛 admission = $𝟏𝟎,𝟓𝟗𝟎.𝟕𝟐

3. Cost incurred when a patient isn't readmitted even though the model predicts s/he will

It is in the best interest of a hospital to know exactly if a patient will be readmitted. The cost to

control the risk is very high, but the value of perfect information about a patient’s readmission

can tell us a lot about the costs of preparing for a patient readmission which might go to waste

due to misclassification. A False Positive classification can hence, lead to a cost of missed

opportunity and excess capacity. We can gauge the value of this cost by exploring the average

medical costs for diabetes diagnosis.

From [5], we can observe that the total cost of Diabetes diagnosis is well documented. Our focus

is on the tangible medical costs which is the biggest part of the cost. We also observe that the

medical costs involve patient costs as well. But, for our problem, we are primarily concerned

with the costs incurred by hospitals in preparing for a diabetic encounter/diagnosis. This is

illustrated by the 60% share of hospital costs to the total medical cost incurred by a hospital. As

we are comparing the costs at a per capita level, we normalize this value with the total number of


diabetes patients in a year from [6]. The cost of False Positive class misclassification can then be

given by:

𝐶𝑜𝑠𝑡 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 admission" when "𝑛𝑜 𝑟𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛"

= 𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑑𝑖𝑐𝑎𝑙 𝐶𝑜𝑠𝑡𝑠 𝑜𝑓 𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠 ∗% ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙 𝑐𝑜𝑠𝑡𝑠

# 𝑜𝑓 𝑑𝑖𝑎𝑏𝑒𝑡𝑒𝑠 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠

Therefore,

𝐶𝑜𝑠𝑡 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 admission" when "𝑛𝑜 𝑟𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛" = $𝟑,𝟎𝟏𝟕.𝟏𝟒

4. Benefit of predicting a right 'no readmission'

The benefit to predicting a right ‘no readmission’ is almost negligible, as most cases are ‘no

readmission’ and the hospital does not gain significant value from this information. More

importantly, this classification predicts the majority class and does not have any direct link to

any of the major Medicare policy decisions that might impact hospitals substantially

To summarise, the cost penalty matrix looked at the factors that influence various classification

and misclassification costs. The cost penalty matrix helps us to remodel the problem as a ‘cost

reduction’ optimization model, which is especially useful in a highly imbalanced data

environment. The cost penalty matrix helps us in identifying probability thresholds that are more

sanguine and appropriate to evaluate our models

Logistic Regression

Readmission for diabetic patient encounters can be modeled as a probability based on the

given inputs. As such, instead of a classification model, we can aim to predict the probability of

readmission using a logistic regression model. A probability model will help us tweak the final

thresholds at various levels. This is desirable because we have already established that the data is

highly imbalanced, and hence would require further inspection in the classification results.

A logistic regression model is defined as:

ln𝐹(𝑥)

1− 𝐹(𝑥) = 𝛽! + 𝛽!𝑥

Readmit NoAdmitReadmit -‐96.56$ 10,590.72$ NoAdmit 3,017.14$ -‐$

Predicted

Actual

Cost Matrix


where F(x) is interpreted as the probability of the dependent variable equaling a "success" or

"case" rather than a failure or non-case. ‘x’ represents the input, independent variable space.

Data Preparation

For our problem, the dependent variable is the “success” or “failure” of readmission

based on the variable space that has been described before. Logistic regression has the ability to

parse both continuous and categorical variables, the latter incorporated into the model through

dummy encoding. As most of the variables in the input space are categorical and dummy

encoding is easier to implement in R, we use the statistical software R to model and test the

logistic regression model on the data using standard R packages such as ‘glm’ and explore

regularization through ‘glmnet’

The input data was processed to account for levels, and appropriate features were changed into

factors. We replace the diagnosis variables with the Diabetes Diagnosis metric developed earlier.

We also eliminate variables that do not show any variability, as it would adversely affect the

performance of the logistic regression model.

The processed data is first divided into test and train, maintain the positive:negative class balance

as in the original data. This is done to ensure that the model predictability is done on a robust

test, controlling for any other factors except the viability of the model.

Model Setup and Results

We first use the glm() package on R to fit the model on training data. This requires a

transformation of the input space into a data matrix. We also have to eliminate any features that

do not have any variability – in our model, it implies that some categorical variables with only 1


level have to be removed before running the model, as it does not have any predictive power in

the model, and leads to matrix inversion error during the coefficient estimation.

One learning outcome from the results above is that the model contains many features that are

insignificant and/or are of little consequence (miniscule coefficients). Also, the nature of feature

vectors will tend to inflate certain coefficients. This leads us to explore regularization methods

that achieve feature selection and/or feature control (through control of inflated coefficient

values)

Regularization

We now setup a new model with Lasso regularization using glmnet() package on R.

Lasso regularization is run on glmnet() using the alpha parameter equal to 1. The default setting

of 10-fold cross validation is used in this case.

The results from Lasso are shown below. We first determine the optimum regularization

parameter (lambda) that minimizes the Mean Square Error across the cross validation through

the graph below.

We observe that the Lasso Regularization penalty factor for the lowest MSE reduces the

coefficients for almost all of the features. This implies that the predictive power of most features

is very low. The best model fit (lowest MSE) is obtained when only 2 features are used.

At this stage, we could look at a penalty factor in the vicinity of the above factor, or a penalty

factor that achieves a low (but not the lowest) MSE, but still retain a reasonable number of

features to work with. We could also look at another kind of regularization – Ridge, to abandon

feature selection and focus just on the control of large coefficients from the original model.


We now setup a new model with Ridge regularization using glmnet() package on R using the

alpha parameter as 0. The default setting of 10-fold cross validation is used in this case.

The results from Ridge are shown below. We first determine the optimum regularization

parameter (lambda) that minimizes the Mean Square Error across the cross validation through

the graph below

Based on the values of the coefficients from Ridge regularization, the important variables/levels

in the model are as below:

Higher the chance of readmission when

Feature Level Significant

visit_count More the number of encounters recorded in data

age 80-90 and surprisingly, 20-30 as well

admission “transfer from a hospital” or “emergency room”

admission source “transfer from inpatient” or “court/law enforcement”

discharge disposition “expected to return” or “discharged within

institution”

medical speciality ‘hematology’ and ‘oncology’

Hb1ac No A1c test was done

medications when repaglinide and insulin dosage was increased


Test Performance

Before we can test the model we have just developed, it is important to address the

imbalance in classes in the data. This affects the model performance considerably, as the

threshold for class prediction can no longer remain at the default value of 50%.

We have explored two ways in which we determine an appropriate threshold for this exercise:

a) Manually inspect various values of threshold for a good mix of accuracy and recall

b) Use the cost penalty matrix to determine a good threshold

We now test the Ridge regularized model on the test data that was prepared earlier. The data

preparation for train and test was done together to ensure that there are no dimensional errors

during model testing.

For a classification model such as the logistic regression model just developed, one of the best

ways to gauge the model performance is to observe the confusion matrix and ROC curve. A

good model fit is where the area under curve is high, which leads to maximum accuracy. The

ROC curve generated for a threshold value of 0.3 is shown below. As indicated in the diagram,

we are primarily interested in the low ‘false positive rate’ region of the graph, due to the extreme

bias in classes in the data

We briefly shift our focus to assigning a new

threshold based on the cost penalty matrix. Our

hypothesis is that we should aim to bring the cost

of predicting admission or no readmission to as close as possible, without affecting accuracy

considerably. We revisit the cost matrix below:


Suppose that the probability threshold (probability of actual readmission) is p

Then, cost of predicting readmission = cost of predicting admission is given by:

-96.56p + 3,017.17(1-p) = 10,590.72p

Hence, p = 0.22

We recalibrate our model to reflect this new threshold and obtain the following results (on next

page)

We see that our new model (threshold) obtains a healthy accuracy of 88.88% while also

optmising the cost function based on a previously defined cost penalty matrix. This motivates us

to go ahead with this model as the final model out of the Ridge regularization of logistic

regression

Artificial Neural Network - Multilayer Perceptron

A Multilayer perceptron is used to improve on the classification accuracy of logistic

regression and also because they work particularly well for:

- Capturing associations or discovering regularities within a set of patterns

- Cases where the volume, number of variables or diversity of the data is very great

Readmit NoAdmitReadmit -‐96.56$ 10,590.72$ NoAdmit 3,017.14$ -‐$

Predicted

Actual

Cost Matrix


- The relationships between variables are vaguely understood

- The relationships that are difficult to describe adequately with conventional approaches

Because of the lack of our domain expertise in the healthcare, Multilayer perceptron is a great

tool because there are very specific details on medications provided, lab tests performed etc.

which are difficult to use without.

A model on the pre-processed data without any further processing was attempted which failed

due to close to 800 levels in the three diagnoses variable which resulted in more than 2400

dummy variables and would be practically possible only on a server. To reduce the number of

levels in the categorical variables, the data was converted in the following two ways:

- Physician specialty was consolidated to lower number of levels (because there were

variation in denoting the same specialty)

- The Diagnoses was reduced to a binary variable which denoted the absence/presence of

diabetes diagnoses

The representation of the diagnoses to a binary variable obviously does lead to loss of

information but this is needed to reduce the vertical dimensionality which was impeding the

model building exercise. A work around - association analysis of the diagnoses can be used to

retain the information in lower dimensions has been explained in a later section. The model was

trained on a down sampled version of the training set to ensure better results because the data is

imbalanced (with only 11% of readmissions).

A MLP (Multilayer Perceptron) has a couple of knobs that can be tweaked to obtain the best

performing model – the number of iterations (epochs), the number of nodes and the number of

layers. The model was first tested on an artificially balanced tests set to understand the

performance of the model. The results are shown as below:

The low accuracy rate (60.4%) of the model even on a balanced data set can signify any one of

the two things:

- The model is not accurate enough

- There are missing factors/variables in the data (based on studies socio economic status and

things like availability of a car are important factors

The next step is to understand the best parameters for which predictions are the best on a

imbalanced test set which is a simulation of the real life scenario. Based on the random training


set that was chosen some of the variables like examide, citoglipton, acetohexamide, troglitazone,

glimepiride.pioglitazone, metformin.rosiglitazone and metformin.pioglitazone were removed

because they do not show any variation/change in the sample.

N_Iter

Number of hidden units

Accuracy AUC

500 25 57.16% 0.601

500 30 56.80% 0.586

500 40 70.66% 0.604

500 50 70.65% 0.604

500 70 34.60% 0.429

750 25 56.63% 0.599

1000 20 58.27% 0.603

1000 25 55.70% 0.597

1500 50 30.13% 0.438

400 40 58.85% 0.630

300 40 33.90% 0.432

600 40 37.34% 0.432

1000 40 0.5667 0.592

As seen from the model none of the models are performing particularly well even after tweaking

the parameters. It can be seen that for number of iterations/epochs equal to 500, the AUC and

accuracy increases. This can be attributed to reduction in bias, but increasing the number of

hidden nodes further increases the variance, hence the over fitting the data and performing

poorly on the test/holdout set. Similarly while keeping the number of hidden units constant at 40,

it is noticed that accuracy and AUC is highest for n_iter= 500. Increasing the number of epochs

from 500 increases the variance, resulting in overfitting of the data.


One of the reasons, the models are not robust enough might because the number of level for the

diagnoses variable was reduced drastically from 800 to 2 each. To try to faithfully represent the

information captured in these variables, the binary variables are replaced by the association score

(explained in the other section). The inclusion of the association variable or adding another layer

of hidden units does not improve the performance in terms of predictions. These are strong

pointers to the fact that the data has missing variables which are crucial to the understanding

patterns within patient re-admissions.

Performance Parameters post inclusion of association variables of diagnoses

N_iter

Number of hidden

units

Accurac

y AUC

200 40 60.28% 0.6542

300 40 68.29% 0.6797

350 40 67.55% 0.6721

400 40 61.59% 0.6544

500 40 41.54% 0.4515

700 40 41.17% 0.4502

300 50 66.59% 0.6361

400 50 60.35% 0.6739

500 50 34.54% 0.3921


Support Vector Machine Approach to Classification

Our next thought was to use a support vector machine to classify patients. We thought

this approach would work well because we know support vector machines work well with high-

dimensional data and they are also able to handle imbalanced data with respect to the

classification variable of interest. We also knew that SVMs performed better for many tasks

when compared to other models.

We ran into several difficulties when attempting to utilize SVMs with our data however. The

first issue was transforming all of the categorical variables into dummy variable columns so they

could be used in the analysis. While not technically complex to accomplish, given how many

rows there were the explosion in columns led to a matrix too large for the SciKitLearn SVC

package in Python to handle.

We solved this problem in three ways. First, as discussed earlier, we changed our approach to

diagnosis classification to result in a binary classification rather than a classification with

hundreds of levels. Second, we switched to using the LinearSVC sub-package of the SVC

package. While this limited our choice of kernel, it allowed us to run our classifier on all of the

data we had without having to limit ourselves to analysis on sub-samples. Third, we stored our

data in a sparse matrix structure, as many columns were filled with a majority of zeroes.

We still had a choice in slack variable. We used the grid search package in SciKitLearn to

accomplish a search for the optimal value of the slack penalty parameter using 5-fold cross

validation. However, we still needed to account for the class imbalance. This was accomplished

using the class_weight parameter of the LinearSVC function. Below is a table summarizing our

classification accuracy for varying levels of the slack parameter and at three different class

weights for the readmission variable. Class weights were based around the number 10, as the

readmission rate is about 10%.

Classification Accuracies for Support Vector Machines

Readmit Class Weight

9 10 11

Slac

k Pa

ram

eter

V

alue

0.01 0.65379 0.62453 0.45176 0.05 0.65283 0.60785 0.46516

0.1 0.65283 0.59751 0.46605 0.2 0.65292 0.59 0.45414 0.3 0.65298 0.57811 0.4667


0.4 0.65304 0.57488 0.47868 0.5 0.65307 0.58717 0.43954 0.6 0.65303 0.58385 0.46832 0.7 0.6466 0.58226 0.42342 0.8 0.64574 0.58729 0.56438 0.9 0.65366 0.60095 0.52788

1 0.6458 0.50602 0.53479 1.25 0.59463 0.57669 0.57291

1.5 0.61636 0.60159 0.5741 1.75 0.61069 0.48473 0.61699

2 0.63502 0.60399 0.68049

For class weights of 9 and 10, the optimal slack penalty is 0.01, while for a class weight of 11,

the optimal slack penalty is 2. We select these models to investigate further on our holdout test

set. The confusion matrices which result from the test set follow.

Confusion Matrix for Slack Penalty = 0.01 and Class Weight = 9

Predicted Outcome

Won't Readmit

Will Readmit

Actual Outcome

Didn't Readmit 10320 7283

Readmit 1034 1152

Accuracy = 57.97%

True Positive Rate = 52.7%

False Positive Rate = 41.37%

Cost (from Cost Penalty Matrix) = $32,813,397.98

Confusion Matrix for Slack Penalty = 0.01 and Class Weight = 10

Predicted Outcome

Won't Readmit

Will Readmit

Actual Outcome


Readmit 1026 1160


Accuracy = 57.89%




Confusion Matrix for Slack Penalty = 2 and Class Weight = 11

Predicted Outcome

Won't Readmit

Will Readmit

Actual Outcome


Readmit 1146 1040

Accuracy = 63.11%




If we care most about the true positive rate, then we shouldn’t go with the third model. But by

every other metric, the third model outperforms the first two. Yet, when looking at the ROC

curves of the three models, it’s hard to distinguish between them.


The third model has a smoother curve and at roughly 0.4 false positive rate and 0.55 true psotive

rate comes closest to the top left corner. It does appear that the third model is best among the

SVMs. However, we still achieve higher accuracy rates with other models, notably logistic

regression.

Ensemble Methods

We decided to explore ensemble-based classifiers assuming that these groupings of weak

learners would help improve the accuracy metrics that have been presenting themselves so far.

Ensembles create a set of classifiers and then classify a test data point by taking a weighted

average of the individual “weak” models’ predictions. Specifically we explored random forest

and gradient boosted decision trees.

Random Forest

Random forest is an ensemble of unpruned classification or regression trees, induced

from bootstrap samples of the training data, using random feature selection in the tree induction

process. Predictions are made by aggregating (majority vote for classification or averaging for

regression) the individual predictions of the members of the ensemble. Random forest generally

exhibits a substantial performance improvement over the single tree classifier such as CART and

C4.5. However, similar to most classifiers, RF can also suffer from the curse of learning from an

extremely imbalanced training data set. As it is constructed to minimize the overall error rate, it

will tend to focus more on the prediction accuracy of the majority class, which often results in

poor accuracy for the minority class. To alleviate the problem, we worked with random forest on

a balanced data set.

Since event rate in our data was only ~11%, the majority class was down-sampled to attain a

50:50 dataset with the minority class. The model was then run on all independent variables

except medical specialty as our base iteration with parameter setting of number of trees = 500;

number of variables selected at a time = 4. Interestingly enough, Random Forest was unable to

run on variables with over 32 levels, and hence we had to manipulate variables with levels

greater than 32.

We had a hypothesis around the first, second and third diagnosis to be important indicators of

readmission. Unfortunately, these variables had over 700 levels. We converted the levels into 2


level variables by just indicating if diabetes was diagnosed or not. Levels 204 and 205 are

indicator of diabetes via diagnostic tests.

Results from the random forest classifier are given below:

Iteration 1: Trained Random Forest on the Imbalanced dataset

Confusion Matrix and Statistics:

Reference

Prediction 0 1

0 17606 2262

1 0 1

• Accuracy : 0.8862

• Sensitivity : 1.0000000

• Specificity : 0.0004419


Iteration 2: Trained Random Forest on Balanced dataset with diagnosis variable coded as Yes or

No, as explained earlier

The following plot shows the error rate

The most important variables:



Reference

Prediction 0 1

0 1678 1166

1 608 1074




Predicting the model trained on sampled balanced dataset on the entire data Confusion Matrix and Statistics:

Reference

Prediction 0 1

0 67231 1720

1 20798 9594




Iteration 3: Used association analysis variable for diagnosis instead of diag1, diag2, and diag3 –

slight accuracy improvement

Testing the model on sampled balanced test set


Reference

Prediction 0 1

0 1672 1150

1 614 1090





Predicting the model trained on sampled balanced dataset on the entire data


Reference

Prediction 0 1

0 66986 1608

1 21043 9706




All models were trained on the balanced dataset (except iteration 1) and tested on a balanced test

set. In addition, they were also tested on the entire dataset. The train to test ratio used is 80:20.

Generalized Boosting Models – Gradient Boosted Decision Tree

Another form of ensemble is using gradient boosting to improve upon the accuracy of other

predictive models, generally and for the purpose of this project, decision trees. Gradient boosting

builds the model in a stage-wise fashion combining multiple weak-learners in to a strong learner.

Like many predictive models, parameters can be changed and tested to assist and ensure that the

model is being trained appropriately and not over fitting to the available data. The specific

parameter to be tuned is the number of trees and via cross validation we have determined that the

best number of trees is 500. This results in the lowest Bernoulli deviance. It is also the maximum

number of trees that were tried with this model. Another parameter chosen was the interaction

depth, which is the level of interaction allowed between predictors. When set to two, we are


fitting a number of decision stumps, therefore to allow versatility and a better fit to the data while

ensuring to dodge the problem of overfitting the dataset, an interaction depth of three was

picked.

One clear advantage of using gradient boosting was the fact that it was not thrown off by our

heavily diverse factor variables of primary, secondary and additional secondary diagnoses.

Where other models failed (for example Random Forest), the only choke for the GBM was

compute time.

Dealing with class imbalance is also a challenge with the gradient boosted model, hence

undersampling of the negative class was performed to ensure that the training set available to the

model was balanced. Accuracy measures were as follows:

• Test set accuracy of the gradient boosted model on the imbalanced dataset (11% incident

rate): 80%

o Unfortunately this is very poor performance as the baseline accuracy we would

expect should be 89%

o Throughout our experience with this dataset however, such results have always

occurred, potentially indicating the lack of predictive power of the available

variables.

• Test set accuracy of the gradient boosted model on the balanced dataset (50% incident

rate): 55%

o This is a marginally better accuracy than randomly assigning outcomes, however

it is clear that the model is doing better when trained with the class imbalance

problem solved.

o Again, feature relevance is probably what is the insight that is being conveyed

through this relatively poor accuracy measure

Using the R package for gradient boosting, relative importance / influence of the available

features was extracted. Shown below is the output.


Clearly, we can see that the number of in-patient visits is the most relevant to readmission

followed by the primary, additional secondary, and secondary diagnoses. Surprisingly, none of

the medical data related variables are good indicators of readmission according to the GBM. As

discussed earlier, we can chalk this up to our dataset not having the most relevant variables that

predict readmission.

Conclusion

In conclusion, after a lot of data preprocessing, cleaning, and predictive modeling, no one

accuracy score has been better than the baseline accuracy. Logistic regression does an excellent

job, and models that usually deal with high dimensional data well do not stand out such as SVM,

and the ensembles. The poor performance of these classifiers can be attributed to lack of the

underlying predictor for readmission rates. As Professor Ghosh (the instructor of this class)

mentioned, the best predictors of a hospital readmission is not necessarily the medical and

biological data. It usually turns out to be socio-economic indicators at the patient level. The

dataset did not have this information, and we can stipulate that our models would have

performed much better given this information.


References:

https://onlinecourses.science.psu.edu/stat504/node/149

http://www.ats.ucla.edu/stat/stata/dae/logit.htm

http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

http://www.r-bloggers.com/examples-and-resources-on-association-rule-mining-with-r/

http://cran.csiro.au/web/packages/arules/vignettes/arules.pdf

http://www.rdatamining.com/examples/association-rules

https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf

http://chandlerzuo.github.io/blog/2015/03/weightedglm/

http://www.jmlr.org/papers/volume8/owen07a/owen07a.pdf

classifying readmissions of diabetic patient encounters

Data & Analytics