Estimating a time to en event

Starting with the most basic distinguishing feature that explains the need to use survival modelling… When estimating the average time to an event, a common mistake is to calculate using only the cases where the event has occurred. However, analysing only the data of occurred events leads to biased results. Survival analysis overcomes this by incorporating information from both events that have occurred (censored) and events that have not occurred (uncensored), resulting in a more precise and reliable analysis.

Some other applications of survival analysis include customer loss (aka churn) analysis, product lifespan analysis, insurance risk assessment, and evaluating medical treatment outcomes.

The application in this post will cover a use case from HR domain using the fictitious IBM HR Analytics Attrition Dataset created by IBM data scientists.


Loading the required packages and data

rm(list = ls()) # clear the objects from work environment
library(tidyverse)
library(ggplot2)
library(knitr)
library(skimr)
library(survival)
library(survminer)
library(data.table)
library(htmltools)
df <- fread("WA_Fn-UseC_-HR-Employee-Attrition.csv")
source("custom_themes.R", encoding = "UTF-8") # import pre-built custom themes
kable(head(df)) # the kable() function prints the head of the data in a clear format
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 2 Female 94 3 2 Sales Executive 4 Single 5993 19479 8 Y Yes 11 3 1 80 0 8 0 1 6 4 0 5
49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 3 Male 61 2 2 Research Scientist 2 Married 5130 24907 1 Y No 23 4 4 80 1 10 3 3 10 7 1 7
37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 4 Male 92 2 1 Laboratory Technician 3 Single 2090 2396 6 Y Yes 15 3 2 80 0 7 3 3 0 0 0 0
33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 4 Female 56 3 1 Research Scientist 3 Married 2909 23159 1 Y Yes 11 3 3 80 0 8 3 3 8 7 3 0
27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 1 Male 40 3 1 Laboratory Technician 2 Married 3468 16632 9 Y No 12 3 4 80 1 6 3 3 2 2 2 2
32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 8 4 Male 79 3 1 Laboratory Technician 4 Single 3068 11864 0 Y No 13 3 3 80 0 8 2 2 7 7 3 6


Data quality checks

There are 1470 observations and 35 variables in the dataset, and none of the variables has a missing value.

Data summary
Name df
Number of rows 1470
Number of columns 35
Key NULL
_______________________
Column type frequency:
character 9
numeric 26
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Attrition 0 1 2 3 0 2 0
BusinessTravel 0 1 10 17 0 3 0
Department 0 1 5 22 0 3 0
EducationField 0 1 5 16 0 6 0
Gender 0 1 4 6 0 2 0
JobRole 0 1 7 25 0 9 0
MaritalStatus 0 1 6 8 0 3 0
Over18 0 1 1 1 0 1 0
OverTime 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.92 9.14 18 30.00 36.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 802.49 403.51 102 465.00 802.0 1157.00 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.19 8.11 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.91 1.02 1 2.00 3.0 4.00 5 ▂▃▇▆▁
EmployeeCount 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
EmployeeNumber 0 1 1024.87 602.02 1 491.25 1020.5 1555.75 2068 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.72 1.09 1 2.00 3.0 4.00 4 ▅▅▁▇▇
HourlyRate 0 1 65.89 20.33 30 48.00 66.0 83.75 100 ▇▇▇▇▇
JobInvolvement 0 1 2.73 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.06 1.11 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.73 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6502.93 4707.96 1009 2911.00 4919.0 8379.00 19999 ▇▅▂▁▂
MonthlyRate 0 1 14313.10 7117.79 2094 8047.00 14235.5 20461.50 26999 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.69 2.50 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.21 3.66 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.08 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StandardHours 0 1 80.00 0.00 80 80.00 80.0 80.00 80 ▁▁▇▁▁
StockOptionLevel 0 1 0.79 0.85 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.28 7.78 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.80 1.29 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.76 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 7.01 6.13 0 3.00 5.0 9.00 40 ▇▂▁▁▁
YearsInCurrentRole 0 1 4.23 3.62 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.19 3.22 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.12 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁


Survival object creation

To start the analysis, a survival object needs to be created first. This is done via the Surv function.

df[, Attrition := ifelse(Attrition == "Yes", 1, 0)]
s_obj <- Surv(time = df$YearsAtCompany, event = df$Attrition)
head(s_obj)
[1]  6  10+  0   8+  2+  7+

The survival object includes time and censoring information. Numbers represent event duration, and the + symbol indicates the event did not occur during the observation time.


Fitting the K-M model

The Kaplan-Meier method is a non-parametric descriptive model, meaning that it does not rely on any assumptions about the distribution of the variables. It effectively handles censored data.

km_fit <- survfit(s_obj ~ 1) # The 1 represents no grouping variable, so it's a single overall survival curve.
# summary(km_fit)


Building a table including only the elements required for the rest of the application.

# names((summary(km_fit)))
km_fit_summary <- as.data.table(summary(km_fit)[c("time", "n.risk", "n.event", "n.censor", "surv", "std.err")])
kable(km_fit_summary)
time n.risk n.event n.censor surv std.err
0 1470 16 28 0.9891156 0.0027062
1 1426 59 112 0.9481915 0.0058260
2 1255 27 100 0.9277922 0.0068977
3 1128 20 108 0.9113420 0.0076939
4 1000 19 91 0.8940265 0.0085117
5 890 21 175 0.8729314 0.0094742
6 694 9 67 0.8616110 0.0100748
7 618 11 79 0.8462749 0.0109051
8 528 9 71 0.8318498 0.0117315
9 448 8 74 0.8169953 0.0126430
10 366 18 102 0.7768152 0.0151588
11 246 2 30 0.7704996 0.0156796
13 200 2 36 0.7627946 0.0164422
14 176 2 16 0.7541265 0.0173602
15 158 1 19 0.7493536 0.0178944
16 138 1 11 0.7439235 0.0185704
17 126 1 8 0.7380193 0.0193388
18 117 1 12 0.7317114 0.0201760
19 104 1 10 0.7246758 0.0211733
20 93 1 26 0.7168835 0.0223335
21 66 1 13 0.7060217 0.0244944
22 52 1 14 0.6924443 0.0275304
23 37 1 1 0.6737296 0.0325312
24 35 1 5 0.6544802 0.0368595
31 16 1 15 0.6135752 0.0525618
32 13 1 2 0.5663771 0.0664105
33 10 1 4 0.5097394 0.0803706
40 1 1 4 0.0000000 NaN


The backbone of the survival analysis is the table above. Each row represents a year of employment, indicating the number of employees still active (n.risk), those who left the organization (n.event), and the corresponding survival probability (surv_prob). The table also includes the number of censored employees (n.censor), which refers to those who are still employed or whose employment status is unknown at that time point. Additionally, the std.error column shows the standard error of the survival probability, providing a measure of the variability and reliability of the survival estimates.

Visualizing the survival probabilities

Show the code
suv_x_lab <- "\nYears At Company"
suv_y_lab <- "Probability of Continued Employment\n"
xmax <- max(df$YearsAtCompany)

ggsurv_all <-
  ggsurvplot(fit = km_fit, 
             data = df, 
             title = "\n\nSurvival Probabilities (All Employees)", 
             xlab = suv_x_lab, 
             ylab = suv_y_lab,
             break.x.by=1,
             break.y.by=.1,
             surv.median.line = c("hv"),
             conf.int = FALSE,
             censor = FALSE,
             xlim = c(0, xmax),
             surv.scale = "percent",
             risk.table = FALSE, 
             fontsize = 4,
             risk.table.title=element_blank(),
             legend.title = element_blank(),
             risk.table.col = "#808080",
             ggtheme = theme_survival,
             legend = "none"
  ) 

ggsurv_all$plot <- ggsurv_all$plot + scale_x_continuous(breaks = seq(0, xmax, 1))
print(ggsurv_all)


The interpretation of the visual begins with understanding the axes. The X-axis is the period (in years) over which the cases in the study were observed. And, the Y-axis shows the percentage of cases that haven’t experienced the event of interest (in this case, employee loss) until the points in time on the X-axis. In other words, it represents the probability of continued employment.

The drops on the line represent the employees who left the company. Each drop lowers the survival percentage, reflecting the probability of continued employment beyond the years represented on the X-axis.

By looking at the graph or km_fit_summary output, the probability of surviving (or remaining employed) at the end of 5 years is approximately 87.3%, while at the end of 10 years, it is around 77.7%. Pretty impressive!


The median survival time is a key metric in survival analysis. It marks the point at which 50% of subjects have experienced the event of interest—in this case, employee loss. To determine this on a graph, examine the Y-axis for the survival probability and locate the 0.5 mark. Trace a horizontal line from this point to the X-axis. The intersection with the X-axis indicates the median survival time. To determine this from a table, find the first instance where the survival probability drops below 0.5 on the surv column. The corresponding time value in this row is identified as the median survival time, which is around the 33rd year in this dataset. However, something to pay attention is the number of observations left after each year. When there is a low number of observation left it is good to be careful with the conclusions including this part of the data. In this example, the level at which 50% of employees have experienced attrition is reached in the year 33 with only 10 observations. When there is a low number of observation left it is good to be careful with the conclusions including this part of the data. In this case, it is better to rely on the value from the latest year that includes 30 or more observations.


Examining the time points with the largest drops in survival percentages is another useful analysis perspective. By doing so, one can identify the years with higher probabilities of employee attrition. In this example, the largest drops are at the end of the 1st and 10th years. If we were to visualize this…

Show the code
km_fit_summary %>% 
  mutate(loss_pct = n.event/n.risk) %>% 
  select(time, loss_pct) %>% 
  filter(time <= 24) %>% 
  ggplot(aes(x = time, y = loss_pct)) +
  # geom_line() +
  geom_point(size=3) +
  geom_line(linewidth=1, color="grey") + 
  labs(title = "Loss Percentage Over Time",
       x = "Years At Company",
       y = "Loss Percentage") +
  theme_survival + 
  scale_x_continuous(breaks = seq(0, 24, 1))


To be continued with a stratification example, which helps to highlight differences in survival rates among distinct subgroups…