Data Analysis Exercise

Suicide Rates in Georgia Analysis

Author

Vijay Panthayi

About the Data Set

(The data set was obtained from the data.CDC.gov)

This dataset contains the age-adjusted death rates for the 10 leading causes of death in the United States beginning in 1999. The age-adjusted rates (per 100,000 population) were based on the 2000 standard population. While it contains an incredibly detailed report about all top 10 causes of death in each of the 50 states (and the District of Columbia) from 1999 to 2017, this data analysis will focus on the temporal trends of suicide rates in Georgia.

Loading the dataset

First, we are going to load the necessary packages for this exercise.

#loading the required packages
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
here() starts at C:/Users/vjpan/Desktop/MADA/MADA2023/vijaypanthayi-MADA-portfolio
library(dplyr)

Now we can read the original data set into R Studio.

#loading the dataset into R
toptendeath <- read_csv(here("NCHS_-_Leading_Causes_of_Death__United_States.csv"))
Rows: 10868 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): 113 Cause Name, Cause Name, State
dbl (1): Year
num (2): Deaths, Age-adjusted Death Rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(toptendeath)
Rows: 10,868
Columns: 6
$ Year                      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20…
$ `113 Cause Name`          <chr> "Accidents (unintentional injuries) (V01-X59…
$ `Cause Name`              <chr> "Unintentional injuries", "Unintentional inj…
$ State                     <chr> "United States", "Alabama", "Alaska", "Arizo…
$ Deaths                    <dbl> 169936, 2703, 436, 4184, 1625, 13840, 3037, …
$ `Age-adjusted Death Rate` <dbl> 49.4, 53.8, 63.7, 56.2, 51.8, 33.2, 53.6, 53…

Cleaning the dataset

The following steps will be to clean the data so that it only contains the factors we will be analyzing and that the data for those variables are clean/usable.

First we are going to reclassify the categorical variables.

toptendeath <- toptendeath %>%
    mutate(`113 Cause Name` = as.factor(`113 Cause Name`)) %>%
    mutate(`Cause Name` = as.factor(`Cause Name`)) %>%
    mutate(State = as.factor(State))

glimpse(toptendeath)
Rows: 10,868
Columns: 6
$ Year                      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20…
$ `113 Cause Name`          <fct> "Accidents (unintentional injuries) (V01-X59…
$ `Cause Name`              <fct> Unintentional injuries, Unintentional injuri…
$ State                     <fct> United States, Alabama, Alaska, Arizona, Ark…
$ Deaths                    <dbl> 169936, 2703, 436, 4184, 1625, 13840, 3037, …
$ `Age-adjusted Death Rate` <dbl> 49.4, 53.8, 63.7, 56.2, 51.8, 33.2, 53.6, 53…

Next, we are going to remove all data points that are not from Georgia. In addition, we are going to remove the 113 Cause Name, as that is not the cause of death category we are using.

clean_toptendeath <- toptendeath %>% select(-(`113 Cause Name`))
glimpse(clean_toptendeath)
Rows: 10,868
Columns: 5
$ Year                      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20…
$ `Cause Name`              <fct> Unintentional injuries, Unintentional injuri…
$ State                     <fct> United States, Alabama, Alaska, Arizona, Ark…
$ Deaths                    <dbl> 169936, 2703, 436, 4184, 1625, 13840, 3037, …
$ `Age-adjusted Death Rate` <dbl> 49.4, 53.8, 63.7, 56.2, 51.8, 33.2, 53.6, 53…
ga_clean_toptendeath <- subset(clean_toptendeath, clean_toptendeath$State == "Georgia")
glimpse(ga_clean_toptendeath)
Rows: 209
Columns: 5
$ Year                      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20…
$ `Cause Name`              <fct> Unintentional injuries, All causes, Alzheime…
$ State                     <fct> Georgia, Georgia, Georgia, Georgia, Georgia,…
$ Deaths                    <dbl> 4712, 83098, 4290, 4399, 4866, 2348, 18389, …
$ `Age-adjusted Death Rate` <dbl> 45.2, 793.7, 46.0, 43.5, 46.2, 21.5, 175.8, …
ga_suicide <- subset(ga_clean_toptendeath, ga_clean_toptendeath$`Cause Name` == "Suicide")
glimpse(ga_suicide)
Rows: 19
Columns: 5
$ Year                      <dbl> 2017, 2016, 2015, 2014, 2013, 2012, 2011, 20…
$ `Cause Name`              <fct> Suicide, Suicide, Suicide, Suicide, Suicide,…
$ State                     <fct> Georgia, Georgia, Georgia, Georgia, Georgia,…
$ Deaths                    <dbl> 1451, 1409, 1317, 1295, 1212, 1168, 1157, 11…
$ `Age-adjusted Death Rate` <dbl> 13.6, 13.3, 12.7, 12.6, 12.0, 11.7, 11.7, 11…

This step created three different data set variables: clean_toptendeath is the original data set without the “113 Cause Name” column ga_clean_toptendeath is the clean_toptendeath but only with the observations regarding Georgia ga_suicide is the ga_clean_toptendeath but only with the observations regarding suicide

Now that the data is cleaned and only contains the yearly suicide deaths in Georgia from 1999 - 2017, we can save this new data set.

ga_suicide %>% saveRDS(here::here("ga_suicide.rds"))

Exploring the dataset

The following added by SETH LATTNER

First, I want to get an idea of the yearly trends in suicide deaths in GA, based on the clean RDS file created by Vijay. To do this, I will plot both suicide deaths and age-adjusted suicide death rate in GA versus year.

library(ggplot2)

ga_suicide_data <- readRDS(here::here("ga_suicide.rds"))

ggplot(ga_suicide_data, aes(Year, Deaths))+
  geom_line(size = 1.5, color="darkcyan")+
  xlab("Year")+
  ylab("Deaths")+
  labs(title = "Suicide Deaths in GA, 1999-2017")+
  theme_bw()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

ggplot(ga_suicide_data, aes(Year, `Age-adjusted Death Rate`))+
  geom_line(size = 1.5, color="plum")+
  xlab("Year")+
  ylab("Age-Adjusted Death Rate")+
  labs(title = "Age-Adjusted Suicide Death Rate in GA, 1999-2017")+
  theme_bw()

There appears to be an increase in both suicide deaths and age-adjusted suicide death rates in GA during the study period. I will do some simple modeling to determine if there is any statistical correlation.

#linear model of suicide deaths vs year
suicide_model <- lm(Deaths~Year, data = ga_suicide_data)
summary(suicide_model)

Call:
lm(formula = Deaths ~ Year, data = ga_suicide_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-103.74  -36.24   12.94   33.04   83.25 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -62058.06    4880.06  -12.72 4.12e-10 ***
Year            31.45       2.43   12.94 3.15e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 58.02 on 17 degrees of freedom
Multiple R-squared:  0.9078,    Adjusted R-squared:  0.9024 
F-statistic: 167.4 on 1 and 17 DF,  p-value: 3.152e-10
#linear model of age-adjusted suicide death rate vs year
death_rate_model <- lm(`Age-adjusted Death Rate`~Year, data = ga_suicide_data)
summary(death_rate_model)

Call:
lm(formula = `Age-adjusted Death Rate` ~ Year, data = ga_suicide_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2158 -0.3530  0.1593  0.4218  0.8600 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -239.2081    51.8124  -4.617 0.000246 ***
Year           0.1249     0.0258   4.841 0.000153 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.616 on 17 degrees of freedom
Multiple R-squared:  0.5796,    Adjusted R-squared:  0.5548 
F-statistic: 23.44 on 1 and 17 DF,  p-value: 0.000153

Now I want to plot these models against the data.

ggplot(ga_suicide_data, aes(Year, Deaths))+
  geom_point(cex=3)+
  geom_smooth(method = "lm", color = "darkcyan")+
  xlab("Year")+
  ylab("Deaths")+
  labs(title = "Suicide Deaths in GA, 1999-2017")+
  theme_bw()

ggplot(ga_suicide_data, aes(Year, `Age-adjusted Death Rate`))+
  geom_point(cex=3)+
  geom_smooth(method = "lm", color = "plum")+
  xlab("Year")+
  ylab("Age-Adjusted Death Rate")+
  labs(title = "Age-Adjusted Suicide Death Rate in GA, 1999-2017")+
  theme_bw()

While this analysis was rather rudimentary based on the dataset at hand, the inclusion of additional covariates from related datasets could yield more informative results. Overall however, there was a statistically significant increase in the number of suicide deaths and the age-adjusted suicide death rate in GA between the years 1999 and 2017.