1. Introduction

#1.-Introduction

In the second part of our series, we will learn about arguably one of the most important part of conducting survival analysis: characterizing the time-dependence of covariate effects on the probability of the event of interest happening at a given time, which is called hazard.

Whether it be patient death or equipment failure, we are interested in identifying factors that increase or decrease the probability of the event occurring, using a subset of approaches called survival regression. However, this is not simply a yes or no question, as often the contribution of a factor can and does change with time. For example, there is evidence that the status of hormone receptors increase the risk of breast cancer metastases during the early stages of the disease, but becomes protective later on. Authors of the study suggested that this reversal may account for previous conventional survival regressions (which did not account for time-dependent effects) not finding the status of hormone receptors to be a significant factor in breast cancer metastases.

One of the most commonly used models for survival regression is the Cox proportional hazards model, which has the assumption that the factors under examination have the same effect on hazard at any point in time. If the data in question does not satisfy the proportional hazards assumption, which is very likely given the complexity of real-life conditions, results of the Cox's model cannot be trusted. Nevertheless, determining which factors have time-varying effects can be quite useful in itself in terms of gaining insights into the data.

If you just landed here, please check out the first part of this series under the "Files" tab for a better introduction to survival analysis.

Let's get started!

2. Import and preprocess data

#2.-Import-and-preprocess-data

Here we will use the same IBM Telco customer churn dataset used in part 1 of this series. We will use label encoding to quickly convert categorical variables to numerical encoding, as required by the lifelines package:

3. Test the proportional hazards assumption

#3.-Test-the-proportional-hazards-assumption

3.1 Using Schoenfeld residuals

#3.1-Using-Schoenfeld-residuals

One of the most commonly used methods to test the proportional hazards assumption is based on scaled Schoenfeld residuals, which is independent of time if the assumption holds. Therefore, for any given covariate, a significant relationship between the Schoenfeld residuals and time indicates its effect on hazard is time-dependent.

First up, we will use the Python package lifelines to calculate and plot scaled Schoenfeld residuals again various transformations of time for each input variable:

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

We see that many aspects of customer characteristics and purchasing behaviour have time-dependent effects on their tendency to leave the company. This is consistent with the non-constant rate of change in the survival curves seen in the first post of this series.

Sticking with our system of aggregating insights from multiple packages and approaches, let's see what results we get using the R package survival, which also examines the relationship between scaled Schoenfeld residuals and time for each variable:

Loading output library...
Loading output library...

Picking Gender and InternetService as examples on the two extremes, the former satisfying the proportional hazards assumption and the latter violating it, we get a more closer look at what a random vs significant relationship between scaled Schoenfeld residuals and time.

As we have done before, we will save the results from each method for a comparison at the end of the post.

Let's compare the statistical significance, in terms of p-values, of the relationship between scaled Schoenfeld residuals and time for each variable as calculated by the two packages. Once again, we use the balloon plot to represent p-values for each variable:

Loading output library...

Unsuprisingly, the two packages identified the same set of time-varying variables (ones at the bottom represented by very small and red-coloured cirlces). I will try to update this post if I find other methods for characterizing time-dependent effects of covariates, so we can have a more meaningful comparison.

In the next few posts, we will explore methods to deal with variables that violate the proportional hazard assumption, including stratification, discretizing continuous variables and introducing time-varying covariates, in addition to the Aalen additive model that does away with the proportional hazards assumption entirely.

As always, any questions and suggestions are very welcome!

Til next time! :)