Whew! After the last lengthy post about variable importance, I am just popping in with a little fun tidbit that I found while researching for the continuous variable discretization post (coming up soon!).
While binning continuous variables is almost always a bad idea for data modeling, it does seem to have some value in exploratory data analysis. Previously, we have used conditional probability density plots to examine how categorical variables vary in relation to a continuous variable. However, visual inspection of these plots can be inconclusive, as it is unclear which ranges of values, along a smooth continuous distribution, have significantly different probabilities for various levels of a given categorical variable. In addition, the probability density plots do not show the number of data points at various values of the continuous variable.
This is where weight of evidence (WOE)-based binning of continuous variables can offer complimentary insights, by creating distinct value segments, based on its relationship to an outcome variable of interest, that differ in some significant way. For this reason, the bins determined using this method is more meaningful than arbitrarily determined bin widths.
Let's take a look!
We will grab our usual IBM Telco customer churn dataset and get only the numeric variables, plus the outcome variable
We will use the R package
scorecard, which is designed for assessing credit risks, to perform WOE-based binning of the three continuous variables in the Telco dataset:
TotalCharges. It has a great functionality that plots the resultant bins and the probability of a given categorical variable, which is
Churn in our case, for each bin, so we can identify potential subpopulations of customers with decreased/increased risk of churn. To illustrate how the WOE bin plots and conditional probability density plots can offer complimnetary insights, we will compare them side-by-side:
We see that both plots identified two groups of customers that are more likely to churn, one group paying \$26-56/month and another paying \$68-106/month. As we had mentioned previously, this could be of interest for the company as these may reflect uncompetitive pricing that should be adjusted. Most interestingly, what the conditional probability density plot does not show is that most customers are in the \$68-106/month tier, which poses a potentially significant problem as these customers are also much more likely to churn than the rest.
Next up, both plots show a steady decrease in probability of the customer to churn as their time with the company increases. This makes sense as customers that are more likely to leave the company would already have done so as time goes on, so that gradually only the loyal customers remain.
TotalCharges gives us a sense of the interaction between
MonthlyCharges. We see that customers with lower total charges are more likely to leave. However, as this could be due to different combinations of monthly fee and tenure with the company, we try to dig deeper using a multivariate scatter plot (much like the one made in the automated exploratory data analysis post). As there are ~7,000 data points, we will set the transparency (
alpha) to very low, so that regions in the plot with many data points will be apparent by more dense colouring.
This plot gives us an idea of the interaction amongst the four variables of interest:
Churn. Churned customers are represented by circles in teal, we see that many have high monthly fees and short tenures (lower right quadrant), which result in lower total charges (small size of the circles). This potentially identifies a segment of high-paying customers that require some extra attention in the first few months after signing on, in the form of discounts or gifts, in order to retain them for longer.
That's it for this post! This is a whole lot of insights for such a simple analysis.
In a upcoming post, I will look at a variety of methods for discretizing continuous variables. While binning is almost always a bad idea for building data models, converting continuous variables into a categorical form has its uses in making these variables available for analysis methods that only work with categorical variables, such as multiple correspondence analysis (MCA) and association rule learning (another post for the near future).
Once again, any questions and suggestions for improvements would be greatly welcomed.
Til next time! :)