analysis of drug related deaths in state of connecticut

IS6030: Data Management-Individual Project

Topic: Drug related deaths in the state of Conncecticut

A. Data Description :

This dataset has the listing of each accidental death associated with drug overdose in Connecticut from 2012 to June 2016. In this dataset columns from ‘Heroin’ to ‘AnyOpioid’ have values Y or Null. That means it states whether the particular drug was the cause of death or not. The death can be caused by one or more drugs. Data was derived from an investigation by the Office of the Chief Medical Examiner which includes the toxicity report, death certificate, as well as a scene investigation. I got this data from catlog.data.gov website and following is the link for the same:

https://catalog.data.gov/dataset/accidental-drug-related-deaths-january-2012-sept-2015

Following table describes the data types stored in each of the columns and their precision and length:

Table: 1

https://catalog.data.gov/dataset/accidental-drug-related-deaths-january-2012-sept-2015

After importing the dataset in SQL server, I made sure that all data types are appropriate and data is imported correctly. (Code for the same is included in code file). In the next step I did some basic checks on important columns like finding out distinct values, number of null records and maximum, minimum and average values for the numerical variables etc.:

Sex:

Race:

Death cause:

Death locations:

This data can be normalized using ‘Case Number’ as the primary key (This column was removed from dataset as it was not necessary for analysis). And the other columns like age, race, ‘ImmeddiatecauseA’ etc. can be put into different table with foreign key in the main table.

B. Data Issues:

There were many data issues that needed to be resolved before starting the analysis on the data:

1. Null Values: There were some null values in some columns of the dataset. As the number was not very large (max: 7) these records were removed from the dataset. This was done in excel.

2. Date Format: While importing dataset in Tableu, I found that date format is not consistent. (I did not face this issue in while importing data in SQL). To solve this I created two more columns for year and month. (Before doing this some year values were missing from the visualization due to improper format)

3. Data structure: With the current data structure it was not possible to get required visualizations in Tableu. Data was restructured in excel to get the same.

4. Inconsistency in time frame: In order to compare the data across the years, average death count per month was used as for year 2016, data of only six months is available.Most of these operations were done in using Excel. Also functions like ‘SUMIFS’, ‘CONCAT’, ‘RIGHT’, ‘MID’, ‘YEAR()’, ‘MONTH()’ etc. were used.

C. Data Analysis in SQL:

Total Number of rows:

Total number of columns:

Number of deaths by year: (count for 2016 will be less as it has only six months data):

Number of deaths by Sex:

Number of deaths by age bracket:

Max, min and average values for age:

Number of deaths by Race:

D. Primary Data Analysis using Tableau :

Average death count per month is increasing with almost constant rate over past 5 years:

Fig. 1

From Figures 2, 3 and 4, we can see that though the number of average deaths per month is maximum for White people, areas with maximum number of deaths (count of all deaths from 2012-2016) are mainly concentrated near the locations where population of Black, Hispanic and Latino people is dense:

Fig. 2

Fig. 3

Fig. 4

For all the races except Black Heroin was the leading cause of death, but in case of blacks Cocaine was

the leading cause:

Fig. 5

Number of average deaths per month is maximum for age group of 40-49 and in all age group 20-60 is the primary victim:

Fig. 6

Heroin is the main cause of deaths followed by cocaine:

Fig. 7

Compared to all other drugs Fentanyl has the highest increase in the deaths over the years. As we can see from the figure below, death count because of all other drugs increases steadily, but there is a jump in the number of deaths because of Fentanyl (specially in 2016):

Fig. 8

From the following plot, we can clearly see that areas with maximum number of deaths are concentrated exactly near the locations where per capita income is quite low:

Fig. 9

Following is the graph of Age vs total number of deaths from year 2012-2016. From the this graph we can see that there is a strong positive correlation between age and number of deaths in the lower spectrum of age and a strong negative correlation in higher spectrum of age.

Fig. 10

E. Correlation and Regression Analysis using R-studio:

Let’s check the correlation and run the regression analysis on the same:

R-studio was used to run the statistical analysis on the data.

a. Correlation Analysis: 1. Following is the correlation between age (lower age group 15-25) and the average number of

deaths per year (i.e. Total number of deaths/4.5, as total number of years is 4.5):0.9812866

2. Following is the correlation between age (Middle age group 26-44) and the average number of deaths per year:0.1022106

3. Following is the correlation between age (higher age group 45-80) and the average number of deaths per year:-0.955015

b. Regression Analysis:

As we can see from above values there is high correlation in lower and higher range of ages and the average number of deaths per year. Now we will run the regression analysis (Using R-studio) on these age groups:

1. Regression analysis on Lower Age group (15-25):Following is the plot of the lower age group vs average number of deaths per year:

Fig. 11

Let’s run the regression model on the data:

From the above output we can see that ‘P’ values for both age and intercept are less than 0.05. This means that ‘Beta’ coefficient for age is significantly different from 0 and age is significant factor in the regression model. As this is simple linear regression model we get the same P values for t-test and F-test.

Also the values for R-square and adjusted R-square are quite high i.e. 0.9629 and 0.9583 respectively.

So, the final model that we generate from above analysis:

Average number of deaths per year=1.7576*(Age) - 28.1160

Let us take a look at the plot of residuals vs fitted values:

s

Fig. 12

As we can see from the above plot there is no specific pattern in the residuals, they are randomly scattered. This means that we have captured most of the signal from the data in deterministic part of our model and remaining is just a random noise.

Now, let’s check the normality of the residuals using the q-q plot. This is our assumption and we need to validate that:

Fig. 13

We can clearly see that above q-q plot is pretty much a straight line passing through 0 which validates our assumption of normality of errors with mean 0 (as line is passing through 0).

2. Regression analysis on Higher Age group (45-80):Following is the plot of the higher age group vs average number of deaths per year:

Fig. 14

Now, let’s run the regression model on the data:

From the above output we can see that ‘P’ values for both age and intercept are less than 0.05 for higher age group as well. This means that ‘Beta’ coefficient for age is significantly different from 0 and age is significant factor in the regression model. As this is simple linear regression model we get the same P values for t-test and F-test.

Also the values for R-square and adjusted R-square are quite high i.e. 0.9121 and 0.9089 respectively.

So, the final model that we generate from above analysis:

Average number of deaths per year=(-0.91072)*(Age) +65.06579

Let us take a look at the plot of residuals vs fitted values:

Fig. 15

As we can see from the above plot there is a straight line of residuals in the lower region of fitted values, but on overall level it looks quite scattered. This means that we have captured most of the signal from the data (specifically in higher fitted value spectrum) in deterministic part of our model and remaining is just a random noise.

Now, let’s check the normality of the residuals using the q-q plot. This is our assumption and we need to validate that:

Fig. 16

We can see from above plot that apart from the curvature at the (-1) quantile, our plot is mostly a straight line.

F. Key Findings and Insights:

1. The areas with maximum number of deaths are concentrated exactly near the locations where per capita income is quite low

2. The areas with maximum number of deaths are mainly concentrated near the locations where population of Black, Hispanic and Latino people is dense thought the number of deaths by drug are maximum for white people

3. For all the races except Black, Heroin was the leading cause of death, but in case of blacks it was Cocaine

4. Though Heroin is the main cause, Fentanyl has the highest rate of increase in the deaths count over the years.

5. Number of average deaths per month is maximum for age group of 40-49 6. We could see the peaks in the death count around age 30 and age 50 and there is a dip in the

death count around age 40.

G. Suggestions:

1. As we clearly see that age group 20-60, which is the backbone generation of any nation, is the primary victim of the drugs and that is mainly due to low income which in turn I think is due to lack of education (which can provide them with decent jobs). This is the big concern as number is increasing every year and government needs to address this issue and plan to provide basic education to these people which can make them employable.

2. As Fentanyl has the highest growth in the drug count, it is not enough to curb the supply of just heroin or cocaine

H. Challenges:

1. Many data issues needed to be resolved while plotting data in Tableau. Learned various functions in excel to overcome them.

2. As there were too many variables in the data, it was difficult to carry out the structured exploratory data analysis to gain meaningful insights. Example, variables like age, race, types of drugs etc. form numerous number of combinations on which the trend of death count could be analyzed.

analysis of drug related deaths in state of connecticut

Documents