employee-attrition-analysis-via-survival-modelling-in-r-1

Estimating a time to en event

Starting with the most basic distinguishing feature that explains the need to use survival modelling… When estimating the average time to an event, a common mistake is to calculate using only the cases where the event has occurred. However, analysing only the data of occurred events leads to biased results. Survival analysis overcomes this by incorporating information from both events that have occurred (censored) and events that have not occurred (uncensored), resulting in a more precise and reliable analysis.

Some other applications of survival analysis include customer loss (aka churn) analysis, product lifespan analysis, insurance risk assessment, and evaluating medical treatment outcomes.

The application in this post will cover a use case from HR domain using the fictitious IBM HR Analytics Attrition Dataset created by IBM data scientists.

Loading the required packages and data

rm(list = ls()) # clear the objects from work environment
library(tidyverse)
library(ggplot2)
library(knitr)
library(skimr)
library(survival)
library(survminer)
library(data.table)
library(htmltools)
df <- fread("WA_Fn-UseC_-HR-Employee-Attrition.csv")
source("custom_themes.R", encoding = "UTF-8") # import pre-built custom themes
kable(head(df)) # the kable() function prints the head of the data in a clear format

Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2
32	No	Travel_Frequently	1005	Research & Development	2	2	Life Sciences	1	8	4	Male	79	3	1	Laboratory Technician	4	Single	3068	11864	0	Y	No	13	3	3	80	0	8	2	2	7	7	3	6

Data quality checks

There are 1470 observations and 35 variables in the dataset, and none of the variables has a missing value.

Data summary
Name	df
Number of rows	1470
Number of columns	35
Key	NULL
_______________________
Column type frequency:
character	9
numeric	26
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Attrition	1	2	3	2
BusinessTravel	1	10	17	3
Department	1	5	22	3
EducationField	1	5	16	6
Gender	1	4	6	2
JobRole	1	7	25	9
MaritalStatus	1	6	8	3
Over18	1	1	1	1
OverTime	1	2	3	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	1	36.92	9.14	18	30.00	36.0	43.00	60	▂▇▇▃▂
DailyRate	1	802.49	403.51	102	465.00	802.0	1157.00	1499	▇▇▇▇▇
DistanceFromHome	1	9.19	8.11	1	2.00	7.0	14.00	29	▇▅▂▂▂
Education	1	2.91	1.02	1	2.00	3.0	4.00	5	▂▃▇▆▁
EmployeeCount	1	1.00	0.00	1	1.00	1.0	1.00	1	▁▁▇▁▁
EmployeeNumber	1	1024.87	602.02	1	491.25	1020.5	1555.75	2068	▇▇▇▇▇
EnvironmentSatisfaction	1	2.72	1.09	1	2.00	3.0	4.00	4	▅▅▁▇▇
HourlyRate	1	65.89	20.33	30	48.00	66.0	83.75	100	▇▇▇▇▇
JobInvolvement	1	2.73	0.71	1	2.00	3.0	3.00	4	▁▃▁▇▁
JobLevel	1	2.06	1.11	1	1.00	2.0	3.00	5	▇▇▃▂▁
JobSatisfaction	1	2.73	1.10	1	2.00	3.0	4.00	4	▅▅▁▇▇
MonthlyIncome	1	6502.93	4707.96	1009	2911.00	4919.0	8379.00	19999	▇▅▂▁▂
MonthlyRate	1	14313.10	7117.79	2094	8047.00	14235.5	20461.50	26999	▇▇▇▇▇
NumCompaniesWorked	1	2.69	2.50	0	1.00	2.0	4.00	9	▇▃▂▂▁
PercentSalaryHike	1	15.21	3.66	11	12.00	14.0	18.00	25	▇▅▃▂▁
PerformanceRating	1	3.15	0.36	3	3.00	3.0	3.00	4	▇▁▁▁▂
RelationshipSatisfaction	1	2.71	1.08	1	2.00	3.0	4.00	4	▅▅▁▇▇
StandardHours	1	80.00	0.00	80	80.00	80.0	80.00	80	▁▁▇▁▁
StockOptionLevel	1	0.79	0.85	0	0.00	1.0	1.00	3	▇▇▁▂▁
TotalWorkingYears	1	11.28	7.78	0	6.00	10.0	15.00	40	▇▇▂▁▁
TrainingTimesLastYear	1	2.80	1.29	0	2.00	3.0	3.00	6	▂▇▇▂▃
WorkLifeBalance	1	2.76	0.71	1	2.00	3.0	3.00	4	▁▃▁▇▂
YearsAtCompany	1	7.01	6.13	0	3.00	5.0	9.00	40	▇▂▁▁▁
YearsInCurrentRole	1	4.23	3.62	0	2.00	3.0	7.00	18	▇▃▂▁▁
YearsSinceLastPromotion	1	2.19	3.22	0	0.00	1.0	3.00	15	▇▁▁▁▁
YearsWithCurrManager	1	4.12	3.57	0	2.00	3.0	7.00	17	▇▂▅▁▁

Survival object creation

To start the analysis, a survival object needs to be created first. This is done via the Surv function.

df[, Attrition := ifelse(Attrition == "Yes", 1, 0)]
s_obj <- Surv(time = df$YearsAtCompany, event = df$Attrition)
head(s_obj)

[1]  6  10+  0   8+  2+  7+

The survival object includes time and censoring information. Numbers represent event duration, and the + symbol indicates the event did not occur during the observation time.

Fitting the K-M model

The Kaplan-Meier method is a non-parametric descriptive model, meaning that it does not rely on any assumptions about the distribution of the variables. It effectively handles censored data.

km_fit <- survfit(s_obj ~ 1) # The 1 represents no grouping variable, so it's a single overall survival curve.
# summary(km_fit)

Building a table including only the elements required for the rest of the application.

# names((summary(km_fit)))
km_fit_summary <- as.data.table(summary(km_fit)[c("time", "n.risk", "n.event", "n.censor", "surv", "std.err")])
kable(km_fit_summary)

time	n.risk	n.event	n.censor	surv	std.err
0	1470	16	28	0.9891156	0.0027062
1	1426	59	112	0.9481915	0.0058260
2	1255	27	100	0.9277922	0.0068977
3	1128	20	108	0.9113420	0.0076939
4	1000	19	91	0.8940265	0.0085117
5	890	21	175	0.8729314	0.0094742
6	694	9	67	0.8616110	0.0100748
7	618	11	79	0.8462749	0.0109051
8	528	9	71	0.8318498	0.0117315
9	448	8	74	0.8169953	0.0126430
10	366	18	102	0.7768152	0.0151588
11	246	2	30	0.7704996	0.0156796
13	200	2	36	0.7627946	0.0164422
14	176	2	16	0.7541265	0.0173602
15	158	1	19	0.7493536	0.0178944
16	138	1	11	0.7439235	0.0185704
17	126	1	8	0.7380193	0.0193388
18	117	1	12	0.7317114	0.0201760
19	104	1	10	0.7246758	0.0211733
20	93	1	26	0.7168835	0.0223335
21	66	1	13	0.7060217	0.0244944
22	52	1	14	0.6924443	0.0275304
23	37	1	1	0.6737296	0.0325312
24	35	1	5	0.6544802	0.0368595
31	16	1	15	0.6135752	0.0525618
32	13	1	2	0.5663771	0.0664105
33	10	1	4	0.5097394	0.0803706
40	1	1	4	0.0000000	NaN

The backbone of the survival analysis is the table above. Each row represents a year of employment, indicating the number of employees still active (n.risk), those who left the organization (n.event), and the corresponding survival probability (surv_prob). The table also includes the number of censored employees (n.censor), which refers to those who are still employed or whose employment status is unknown at that time point. Additionally, the std.error column shows the standard error of the survival probability, providing a measure of the variability and reliability of the survival estimates.

Visualizing the survival probabilities

Show the code

suv_x_lab <- "\nYears At Company"
suv_y_lab <- "Probability of Continued Employment\n"
xmax <- max(df$YearsAtCompany)

ggsurv_all <-
  ggsurvplot(fit = km_fit, 
             data = df, 
             title = "\n\nSurvival Probabilities (All Employees)", 
             xlab = suv_x_lab, 
             ylab = suv_y_lab,
             break.x.by=1,
             break.y.by=.1,
             surv.median.line = c("hv"),
             conf.int = FALSE,
             censor = FALSE,
             xlim = c(0, xmax),
             surv.scale = "percent",
             risk.table = FALSE, 
             fontsize = 4,
             risk.table.title=element_blank(),
             legend.title = element_blank(),
             risk.table.col = "#808080",
             ggtheme = theme_survival,
             legend = "none"
  ) 

ggsurv_all$plot <- ggsurv_all$plot + scale_x_continuous(breaks = seq(0, xmax, 1))
print(ggsurv_all)

The interpretation of the visual begins with understanding the axes. The X-axis is the period (in years) over which the cases in the study were observed. And, the Y-axis shows the percentage of cases that haven’t experienced the event of interest (in this case, employee loss) until the points in time on the X-axis. In other words, it represents the probability of continued employment.

The drops on the line represent the employees who left the company. Each drop lowers the survival percentage, reflecting the probability of continued employment beyond the years represented on the X-axis.

By looking at the graph or km_fit_summary output, the probability of surviving (or remaining employed) at the end of 5 years is approximately 87.3%, while at the end of 10 years, it is around 77.7%. Pretty impressive!

The median survival time is a key metric in survival analysis. It marks the point at which 50% of subjects have experienced the event of interest—in this case, employee loss. To determine this on a graph, examine the Y-axis for the survival probability and locate the 0.5 mark. Trace a horizontal line from this point to the X-axis. The intersection with the X-axis indicates the median survival time. To determine this from a table, find the first instance where the survival probability drops below 0.5 on the surv column. The corresponding time value in this row is identified as the median survival time, which is around the 33rd year in this dataset. However, something to pay attention is the number of observations left after each year. When there is a low number of observation left it is good to be careful with the conclusions including this part of the data. In this example, the level at which 50% of employees have experienced attrition is reached in the year 33 with only 10 observations. When there is a low number of observation left it is good to be careful with the conclusions including this part of the data. In this case, it is better to rely on the value from the latest year that includes 30 or more observations.

Examining the time points with the largest drops in survival percentages is another useful analysis perspective. By doing so, one can identify the years with higher probabilities of employee attrition. In this example, the largest drops are at the end of the 1st and 10th years. If we were to visualize this…

Show the code

km_fit_summary %>% 
  mutate(loss_pct = n.event/n.risk) %>% 
  select(time, loss_pct) %>% 
  filter(time <= 24) %>% 
  ggplot(aes(x = time, y = loss_pct)) +
  # geom_line() +
  geom_point(size=3) +
  geom_line(linewidth=1, color="grey") + 
  labs(title = "Loss Percentage Over Time",
       x = "Years At Company",
       y = "Loss Percentage") +
  theme_survival + 
  scale_x_continuous(breaks = seq(0, 24, 1))

To be continued with a stratification example, which helps to highlight differences in survival rates among distinct subgroups…