Data Analyst Project: Business Decision Research

Prediction Customer Churn

7 min readJan 6, 2021

Customer churn is a customer that did not do any transaction in a certain time. In this project, customer churn is a customer who does not make any transactions to the store during the last 6 months.

Data

This dataset is imaginary data and being a part of the DQLab Project. In the explanation, this data belongs to DQLab Sports Center, built-in 2013 with data collection from 2013 until early 2019.

Data Exploration

The first step to know the data is to do some exploration. Such as preview dataset, get dataset info, check missing value, and get summary statistics.

Import, Preview, Get Info, and Summary Statistics Dataset

Preview Dataset

From this preview, the dataset has numerical and categorical data types.

Get Dataset Info

Dataset is consist of 10.000 entries (rows) and 8 columns, with int64 and object datatype. There is no non-null data in this dataset.

Check Missing Value

To make sure this dataset does not have any missing value, check the missing value from each column. From this result, the dataset did not contain any missing or NaN value.

Summary Statistics

This part gives some statistical information from the numerical column (default). Mean, maximum and minimal value, percentile, and other statistics.

Data Cleansing and Manipulation

The data exploration step gives some information, the next to do in data cleansing step.

Drop unnecessary column(s)

Drop column ‘no’ dan ‘Row_Num’

Column no and Row_Num has no specific meaning, it’s like an index column. So it can be removed.

Convert Data Type

From dataset preview column First_Transaction and Last_Transaction can be datetime datatype. Example, from 1.000000e+05 (int64) to 2018–10-05 05:48:02.608000040 (datetime64[ns]). Date in YYYY-MM-DD format.

Convert Data Type from int64 to datetime64[ns]

Add New Necessary Column

Year Value from each column (First_Transaction and Last_Transaction)

Add Column ’ Year’ from ‘First_Transaction’ and ‘Last_Transaction’ Columns

2. Category from Average_Transaction_Amount. These columns divide into eight categories.

Categorized Customers based on Average_Transaction_Amount

3. Add a column is_churn to labeling the customer churn or not

First, get the last transaction from the Last_Transaction column dataset and save it to date_last_transaction. Subtract 6 months with help from relativedelta library.

Then filter customer with two labels, False and True. If True, the customer didn’t do any transaction in the last 6 six months.

From the result, the last date transaction in 2019–02–01 and six months before in 2018–08–01.

Last Transaction and Label Churn Customer

Preview and Info Dataset After Data Cleansing ad Manipulation

Preview Dataset After Cleansing and Manipulation

Data Visualization

Visualized data to get an insight into the dataset. Some information we can get such as transactions per year, customers per year, the average amount of transactions per year, and trend products per year.

Number of Customer per Year

To get the number of customers each year, column customer should be group by Year_First_Transaction.

Bar Chart: Number Customer per Year

From the following graph, it can be seen that the number of customers has increased each year. The dataset contains data up to February 2019. So total customers in 2019 are still small.

Number of Transaction per Year

Bar Chart: Number of Transaction per Year

In 2018, the number of transactions decreased significantly by 199,210
transaction. This number greater than in 2016, which 18,738 transactions.

Number of Average Amount Transaction per Year

Based on data “Number of Transaction” transactions in 2018 decreased. But the number of transactions (in Billions) in 2018 still stable. The difference amount transaction from 2017 to 2018 decreased by 0.5 Billion.

Trend Product per Year

In the last five years, the jaket is a product on demand is increasing from year to year.

Proportion Churn by Product

Pie graph show in subplots model. This graph needs a pivot table that subplot the graph based on columns in the pivot table.

The graphs above show customer churn was more than 60% for each product. Assume customers need more variation of each product.

Customer Distribution by Average Transaction Amount Group

Many customers spend the amount in the range of 1.000.000 until 2.500.000 for each transaction.

Modeling

Data Correlation

Variable category with the object or category the data type can convert to a numerical way. For example in the Product column, the value of ‘0’ for ‘Baju’ and ‘2’ for a ‘Jaket’. This process is called encoding categorical values. Column to convert Product, is_churn, and Average_Transaction_Amount_Group.

Encode Categorical Value on Column ‘Product’ Using pandas.cat.codes

Before (left) and After (right) Encoding Categorical Value

This process helps get the correlation between variables using a visualization graph and the machine learning model can train these variables.

Correlation between Variables using Heatmap

Visualize the data correlation between variables can use seaborn heatmap. The column is_churn will use as a label of the dataset. The correlation between label is_churn and other variables is negative. Variable such as transaction amount, product, and first-year transaction are a few top correlations.

Features and Label

X is features columns and y is label column. Column is_churn as the label.

Dataset features contain all columns except is_churn, First_Transaction, and Last_Transaction. Because of column First_Transaction and Last_Transaction in datetime64[ns] data type, not numerical data type (int or float). These columns already save year value in Year_First_Time and Year_Last_Time column.

Data train and test

Split Data Train and Test

This dataset split into 75% (75.000 entries) data train and 25% (25.000 entries) data test.

Machine Learning Method

Logistic Regression

Logistic regression for classification problems with label or is_churn column in numeric type. Initially, the label consisted of True and False values (object datatype) and now changes to 1 and 0 (int64 data type).

Logistic Regression Model

Confusion Matrix and Classification Report **Linear Regression Model**

The precision value for label 0 is quite low at 0.31. This means the model tends to predict the label 0 incorrectly. The model accuracy at 0.73.

2. Decision Tree

Decision Tree Model

With the Decision Tree model, accuracy gets better. Model get accuracy at 0.86 and model precision both labels is near 0.80. This model performs much better than the Logistic Regression model.

3. K-Nearest Neighbor

K-Nearest Neighbor Model with Neigbour=7

Confusion Matrix and Classification Report **K-Nearest Neighbour Model**

This model gets accuracy at 0.72. Compare the precision score from Logistic Regression Model near similar.

4. Logistic Regression with Exclude Lower Correlation Features

This model uses different features for data train. From the heatmap before, the lower correlation features more than -0.30 are Customer_ID and Year_Last_Transaction. Because of that, this model excludes this feature from the X variable used before.

Logistic Regression Model with X (data train) Exclude Lower Correlation Features

Confusion Matrix and Classification Report **Logistic Regression Model Exclude Features**

The confusion matrix shows the model can’t predict label 0 correctly. All of the label 0 is predicted as label 1, because of that the precision value for label 0 is 0. The Accuracy of this model at 0.67.

This model can be concluded, that removing features can have an impact on the prediction performance of the model. Although these features have low correlation, they are still important and needed.

Wrap’s up

This Project is a part of the Data Analyst Project in the DQLab Platform.

Hello there, this is my first story to tell about the project. I hope I can grow much better in the future. Appreciate any feedback from readers. Hopefully, this story can be useful.

For project demo or source code, you can visit my GitHub on https://github.com/galihwangiputri/customer-churn.git

Thanks for coming 👋.

Reference

DQLab Data Science Learning Platform