In this blog, we’ll explore discrete and categorical features in the Telco Customer Churn dataset using univariate graphical methods.
Recap
Before we begin
Univariate graphical analysis
Conclusion
In part 4 of the series, Guide to Churn Prediction, we analyzed and explored continuous data features in the Telco Customer Churn dataset using graphical methods.
This guide assumes that you are familiar with data types. If you’re unfamiliar, please read blogs on numerical and categorical data types.
Let’s go over a couple of statistical concepts
Balanced
The data is said to be balanced if the number of records in each category is equal or nearly equal.
Imbalanced: Image by Mediamodifier from Pixabay
Data is said to be imbalanced if the number of records in one category is greater than the number of records in other categories.
Note: If the target feature has categorical data, we’ll look at how data is distributed across all of the categories and check if the feature has balanced or imbalanced data.
The main purpose of univariate graphical analysis is to understand the distribution patterns of features. To visualize these distributions, we’ll utilize Python libraries like matplotlib and seaborn. These libraries contain a variety of graphical methods (such as bar plots, count plots, KDE plots, violin plots, etc.) that help us visualize distributions in different styles.
Now, let’s perform univariate graphical analysis on discrete and categorical data features.
Let’s start with importing the necessary libraries and loading the cleaned dataset. Check out the link to part 1 to see how we cleaned the dataset.
1 import pandas as pd
2 import matplotlib.pyplot as plt # python library to plot graphs
3 import seaborn as sns # python library to plot graphs
4 %matplotlib inline # displays graphs on jupyter notebook
5
6 df = pd.read_csv('cleaned_dataset.csv')
7 df # prints data set
Cleaned dataset
Discrete features are of int data type, while categorical features are of object data type.
Note: Sometimes categorical data is represented in the form of numbers. So if the data type of a feature is int and has unique values (1,2,3,4,5 or 0 and 1, etc.) or categories, then it’s a categorical feature; otherwise, it’s a discrete feature.
So let’s check the data types of features using the dtypes function and identify discrete and categorical features.
1 df.dtypes
Data types of features
”Country,” ”State,” “City,” “Zip Code,” “Gender,” “Senior Citizen,” “Partner,” “Dependents,” “Phone Service,” ”Multiple Lines,” “Internet Service,” “Online Security,” “Online Backup,” “Device Protection,” “Tech Support,” “Streaming TV,” “Streaming Movies,” “Contract,” “Paperless Billing,” “Payment Method,” “Churn Label,” “Churn Value,” and “Churn Reason” features are of object data type, so these are categorical features.
“Count,” “Tenure Months,” “Churn Value,” “Churn Score,” and “CLTV” features are of the int data type. So let’s look at the values in these features and decide if they’re discrete or categorical features.
Display the int data type features using select_dtypes() function.
1 df.select_dtypes(int)
Features of int data type
Based on the type of data, separate the features and create 2 new datasets.
Create a dataset ##df_disc## that contains all the discrete features and display the first 5 records using ##head()## method.
1 df_disc = df[['Tenure Months','Churn Score','CLTV']]
2 df_disc.head()
Discrete features
Create a dataset df_cat that contains all the categorical features and display the first 5 records using head() method.
1 df_cat = df[['Country','State','City','Zip Code','Count','Gender','Senior Citizen',
2 'Partner','Dependents','Phone Service','Multiple Lines','Internet Service',
3 'Online Security','Online Backup','Device Protection','Tech Support','Streaming TV',
4 'Streaming Movies','Contract','Paperless Billing','Payment Method',
5 'Churn Label','Churn Value','Churn Reason']]
6
7 df_cat.head()
Categorical features
We visualize discrete and categorical features distributions using graphical methods like count plots, bar plots, pie charts, etc.
Count plots: These plots are graphical representations of the count of individual values in each category of a dataset. Each bar represents a unique value or a category. The length of each bar represents the number of values in each category.
1 fig = plt.figure(figsize=(14, 8)) # sets the size of the plot with width as 14 and height as 8
2 for i,columns in enumerate(df_disc.columns):
3 ax = plt.subplot(2,2,i+1) # creates subplots in 2 rows with upto 3 plots in each row
4 sns.countplot(data = df_disc, x = df_disc[columns]) # creates count plots for each feature in df_disc dataset
5 ax.set_xlabel(None) # removes the labels on x-axis
6 ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
7 plt.tight_layout(w_pad=3) # adds padding between the subplots
8 plt.show() # to display the plots
Count plots of discrete features
Let’s take a closer look at the “Tenure Months” plot.
“Tenure Months” count plot
Approximately 600 customers have been with the company for one month, and nearly 400 customers have been with the company for 72 months.
1 fig = plt.figure(figsize=(14, 22)) # sets the size of each subplot with width as 14 and height as 22
2 for i,columns in enumerate(df_cat.columns[4:-2]):
3 ax = plt.subplot(7,3,i+1) # creating a grid with 7 rows and 3 columns, it can display upto (7*3)=21 subplots.
4 sns.countplot(data=df_cat, x = df_cat[columns]) # creates count plots for each feature in df_cat dataset
5 ax.set_xlabel(None) # removes the labels on x-axis
6 ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
7 plt.xticks(rotation = 25) #rotate the x-axis values by 25 degrees.
8 plt.tight_layout(w_pad=3) # adds padding between the subplots
9 plt.show() # displays the plots
Count plots of categorical features
The company is providing various services to the customers like phone, internet, multiple telephone lines and other additional services like online security, online backup and device protection plans.
Now, let’s take a closer look at all the plots.
Now, let’s take a look at the distribution of categories in the target feature “Churn Label” and see if the data is balanced or imbalanced.
Yes represents churned customers, while No represents non-churned customers.
When compared to the number of non-churned consumers (~5000), the number of churned customers is quite low (~1900) i.e. the data is not evenly distributed among the categories. So this indicates that the data is imbalanced.
As seen, univariate graphical analysis is the simplest way of analyzing data. This analysis helps us comprehend the data better.
Source: GIPHY
That’s it for this blog. Next in the series, we’ll perform multivariate graphical analysis and find reasons for customer churn.
Thanks for reading!!