Starting with the most basic distinguishing feature that explains the need to use survival modelling… When estimating the average time to an event, a common mistake is to calculate using only the cases where the event has occurred. However, analysing only the data of occurred events leads to biased results. Survival analysis overcomes this by incorporating information from both events that have occurred (censored) and events that have not occurred (uncensored), resulting in a more precise and reliable analysis.
Some other applications of survival analysis include customer loss (aka churn) analysis, product lifespan analysis, insurance risk assessment, and evaluating medical treatment outcomes.
The application in this post will cover a use case from HR domain using the fictitious IBM HR Analytics Attrition Dataset created by IBM data scientists.
Loading the required packages and data
rm(list =ls()) # clear the objects from work environmentlibrary(tidyverse)library(ggplot2)library(knitr)library(skimr)library(survival)library(survminer)library(data.table)library(htmltools)df <-fread("WA_Fn-UseC_-HR-Employee-Attrition.csv")source("custom_themes.R", encoding ="UTF-8") # import pre-built custom themeskable(head(df)) # the kable() function prints the head of the data in a clear format
Age
Attrition
BusinessTravel
DailyRate
Department
DistanceFromHome
Education
EducationField
EmployeeCount
EmployeeNumber
EnvironmentSatisfaction
Gender
HourlyRate
JobInvolvement
JobLevel
JobRole
JobSatisfaction
MaritalStatus
MonthlyIncome
MonthlyRate
NumCompaniesWorked
Over18
OverTime
PercentSalaryHike
PerformanceRating
RelationshipSatisfaction
StandardHours
StockOptionLevel
TotalWorkingYears
TrainingTimesLastYear
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager
41
Yes
Travel_Rarely
1102
Sales
1
2
Life Sciences
1
1
2
Female
94
3
2
Sales Executive
4
Single
5993
19479
8
Y
Yes
11
3
1
80
0
8
0
1
6
4
0
5
49
No
Travel_Frequently
279
Research & Development
8
1
Life Sciences
1
2
3
Male
61
2
2
Research Scientist
2
Married
5130
24907
1
Y
No
23
4
4
80
1
10
3
3
10
7
1
7
37
Yes
Travel_Rarely
1373
Research & Development
2
2
Other
1
4
4
Male
92
2
1
Laboratory Technician
3
Single
2090
2396
6
Y
Yes
15
3
2
80
0
7
3
3
0
0
0
0
33
No
Travel_Frequently
1392
Research & Development
3
4
Life Sciences
1
5
4
Female
56
3
1
Research Scientist
3
Married
2909
23159
1
Y
Yes
11
3
3
80
0
8
3
3
8
7
3
0
27
No
Travel_Rarely
591
Research & Development
2
1
Medical
1
7
1
Male
40
3
1
Laboratory Technician
2
Married
3468
16632
9
Y
No
12
3
4
80
1
6
3
3
2
2
2
2
32
No
Travel_Frequently
1005
Research & Development
2
2
Life Sciences
1
8
4
Male
79
3
1
Laboratory Technician
4
Single
3068
11864
0
Y
No
13
3
3
80
0
8
2
2
7
7
3
6
Data quality checks
There are 1470 observations and 35 variables in the dataset, and none of the variables has a missing value.
Data summary
Name
df
Number of rows
1470
Number of columns
35
Key
NULL
_______________________
Column type frequency:
character
9
numeric
26
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Attrition
0
1
2
3
0
2
0
BusinessTravel
0
1
10
17
0
3
0
Department
0
1
5
22
0
3
0
EducationField
0
1
5
16
0
6
0
Gender
0
1
4
6
0
2
0
JobRole
0
1
7
25
0
9
0
MaritalStatus
0
1
6
8
0
3
0
Over18
0
1
1
1
0
1
0
OverTime
0
1
2
3
0
2
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Age
0
1
36.92
9.14
18
30.00
36.0
43.00
60
▂▇▇▃▂
DailyRate
0
1
802.49
403.51
102
465.00
802.0
1157.00
1499
▇▇▇▇▇
DistanceFromHome
0
1
9.19
8.11
1
2.00
7.0
14.00
29
▇▅▂▂▂
Education
0
1
2.91
1.02
1
2.00
3.0
4.00
5
▂▃▇▆▁
EmployeeCount
0
1
1.00
0.00
1
1.00
1.0
1.00
1
▁▁▇▁▁
EmployeeNumber
0
1
1024.87
602.02
1
491.25
1020.5
1555.75
2068
▇▇▇▇▇
EnvironmentSatisfaction
0
1
2.72
1.09
1
2.00
3.0
4.00
4
▅▅▁▇▇
HourlyRate
0
1
65.89
20.33
30
48.00
66.0
83.75
100
▇▇▇▇▇
JobInvolvement
0
1
2.73
0.71
1
2.00
3.0
3.00
4
▁▃▁▇▁
JobLevel
0
1
2.06
1.11
1
1.00
2.0
3.00
5
▇▇▃▂▁
JobSatisfaction
0
1
2.73
1.10
1
2.00
3.0
4.00
4
▅▅▁▇▇
MonthlyIncome
0
1
6502.93
4707.96
1009
2911.00
4919.0
8379.00
19999
▇▅▂▁▂
MonthlyRate
0
1
14313.10
7117.79
2094
8047.00
14235.5
20461.50
26999
▇▇▇▇▇
NumCompaniesWorked
0
1
2.69
2.50
0
1.00
2.0
4.00
9
▇▃▂▂▁
PercentSalaryHike
0
1
15.21
3.66
11
12.00
14.0
18.00
25
▇▅▃▂▁
PerformanceRating
0
1
3.15
0.36
3
3.00
3.0
3.00
4
▇▁▁▁▂
RelationshipSatisfaction
0
1
2.71
1.08
1
2.00
3.0
4.00
4
▅▅▁▇▇
StandardHours
0
1
80.00
0.00
80
80.00
80.0
80.00
80
▁▁▇▁▁
StockOptionLevel
0
1
0.79
0.85
0
0.00
1.0
1.00
3
▇▇▁▂▁
TotalWorkingYears
0
1
11.28
7.78
0
6.00
10.0
15.00
40
▇▇▂▁▁
TrainingTimesLastYear
0
1
2.80
1.29
0
2.00
3.0
3.00
6
▂▇▇▂▃
WorkLifeBalance
0
1
2.76
0.71
1
2.00
3.0
3.00
4
▁▃▁▇▂
YearsAtCompany
0
1
7.01
6.13
0
3.00
5.0
9.00
40
▇▂▁▁▁
YearsInCurrentRole
0
1
4.23
3.62
0
2.00
3.0
7.00
18
▇▃▂▁▁
YearsSinceLastPromotion
0
1
2.19
3.22
0
0.00
1.0
3.00
15
▇▁▁▁▁
YearsWithCurrManager
0
1
4.12
3.57
0
2.00
3.0
7.00
17
▇▂▅▁▁
Survival object creation
To start the analysis, a survival object needs to be created first. This is done via the Surv function.
The survival object includes time and censoring information. Numbers represent event duration, and the + symbol indicates the event did not occur during the observation time.
Fitting the K-M model
The Kaplan-Meier method is a non-parametric descriptive model, meaning that it does not rely on any assumptions about the distribution of the variables. It effectively handles censored data.
km_fit <-survfit(s_obj ~1) # The 1 represents no grouping variable, so it's a single overall survival curve.# summary(km_fit)
Building a table including only the elements required for the rest of the application.
The backbone of the survival analysis is the table above. Each row represents a year of employment, indicating the number of employees still active (n.risk), those who left the organization (n.event), and the corresponding survival probability (surv_prob). The table also includes the number of censored employees (n.censor), which refers to those who are still employed or whose employment status is unknown at that time point. Additionally, the std.error column shows the standard error of the survival probability, providing a measure of the variability and reliability of the survival estimates.
Visualizing the survival probabilities
Show the code
suv_x_lab <-"\nYears At Company"suv_y_lab <-"Probability of Continued Employment\n"xmax <-max(df$YearsAtCompany)ggsurv_all <-ggsurvplot(fit = km_fit, data = df, title ="\n\nSurvival Probabilities (All Employees)", xlab = suv_x_lab, ylab = suv_y_lab,break.x.by=1,break.y.by=.1,surv.median.line =c("hv"),conf.int =FALSE,censor =FALSE,xlim =c(0, xmax),surv.scale ="percent",risk.table =FALSE, fontsize =4,risk.table.title=element_blank(),legend.title =element_blank(),risk.table.col ="#808080",ggtheme = theme_survival,legend ="none" ) ggsurv_all$plot <- ggsurv_all$plot +scale_x_continuous(breaks =seq(0, xmax, 1))print(ggsurv_all)
The interpretation of the visual begins with understanding the axes. The X-axis is the period (in years) over which the cases in the study were observed. And, the Y-axis shows the percentage of cases that haven’t experienced the event of interest (in this case, employee loss) until the points in time on the X-axis. In other words, it represents the probability of continued employment.
The drops on the line represent the employees who left the company. Each drop lowers the survival percentage, reflecting the probability of continued employment beyond the years represented on the X-axis.
By looking at the graph or km_fit_summary output, the probability of surviving (or remaining employed) at the end of 5 years is approximately 87.3%, while at the end of 10 years, it is around 77.7%. Pretty impressive!
The median survival time is a key metric in survival analysis. It marks the point at which 50% of subjects have experienced the event of interest—in this case, employee loss. To determine this on a graph, examine the Y-axis for the survival probability and locate the 0.5 mark. Trace a horizontal line from this point to the X-axis. The intersection with the X-axis indicates the median survival time. To determine this from a table, find the first instance where the survival probability drops below 0.5 on the surv column. The corresponding time value in this row is identified as the median survival time, which is around the 33rd year in this dataset. However, something to pay attention is the number of observations left after each year. When there is a low number of observation left it is good to be careful with the conclusions including this part of the data. In this example, the level at which 50% of employees have experienced attrition is reached in the year 33 with only 10 observations. When there is a low number of observation left it is good to be careful with the conclusions including this part of the data. In this case, it is better to rely on the value from the latest year that includes 30 or more observations.
Examining the time points with the largest drops in survival percentages is another useful analysis perspective. By doing so, one can identify the years with higher probabilities of employee attrition. In this example, the largest drops are at the end of the 1st and 10th years. If we were to visualize this…
Show the code
km_fit_summary %>%mutate(loss_pct = n.event/n.risk) %>%select(time, loss_pct) %>%filter(time <=24) %>%ggplot(aes(x = time, y = loss_pct)) +# geom_line() +geom_point(size=3) +geom_line(linewidth=1, color="grey") +labs(title ="Loss Percentage Over Time",x ="Years At Company",y ="Loss Percentage") + theme_survival +scale_x_continuous(breaks =seq(0, 24, 1))
To be continued with a stratification example, which helps to highlight differences in survival rates among distinct subgroups…