Bank Churn Prediction using EvalML Library

7 min readJun 13, 2021

About this Dataset

Dataset Link — https://www.kaggle.com/sakshigoyal7/credit-card-customers

A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction

This dataset from a website with the URL as https://leaps.analyttica.com/home.

Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.

We have only 16.07% of customers who have churned. Thus, it’s a bit difficult to train our model to predict churning customers.

Feature Descriptions

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Customer_Age: Demographic variable — Customer’s Age in Years

Gender: Demographic variable — M=Male, F=Female

Dependent_count: Demographic variable — Number of dependents

Education_Level: Demographic variable — Educational Qualification of the account holder (example: high school, college graduate, etc.)

Marital_Status: Demographic variable — Married, Single, Divorced, Unknown

Income_Category: Demographic variable — Annual Income Category of the account holder (< 40K,40K,40K — 60K, 60K−60K−80K, 80K−80K−120K, > $120K, Unknown)

Card_Category: Product Variable — Type of Card (Blue, Silver, Gold, Platinum)

Months_on_book: Period of relationship with bank

Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: Total Revolving Balance on the Credit Card

Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)

Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)

Avg_Utilization_Ratio: Average Card Utilization Ratio

Target

Attrition_Flag: Internal event (customer activity) variable — if the account is closed then 1(Attrited Customer) else 0(Existing Customer)

About EvalML Library

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

Key Functionality

Automation — Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
Data Checks — Catches and warns of problems with your data and problem setup before modeling.
End-to-end — Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
Model Understanding — Provides tools to understand and introspect on models, to learn how they’ll behave in your problem domain.
Domain-specific — Includes repository of domain-specific objective functions and an interface to define your own.

To learn more about this library you can visit this link https://pypi.org/project/evalml/

Installing EvalML Library

pip install evalml

Importing necessary libraries

import evalml
import numpy as np
import pandas as pd

Reading the dataset

data = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
data.head()data.shape(10127, 23)

There are 10127 rows and 23 features in the dataset.

Data Preparation

The first thing we'll do is drop CLIENTNUM from the data since a unique client identifier will have no correlation with attrition rates. Now there's clearly some diversity in the types of features, and at first glace it looks like we don't have to worry about any null or missing values. But that seems unlikely with a dataset of this size.

data = data.drop(['CLIENTNUM'], axis=1)for feature in data.columns:
    if data[feature].dtype not in ['int64', 'float64']:
        print(f'{feature}: {data[feature].unique()}')Attrition_Flag: ['Existing Customer' 'Attrited Customer']
Gender: ['M' 'F']
Education_Level: ['High School' 'Graduate' 'Uneducated' 'Unknown' 'College' 'Post-Graduate'
 'Doctorate']
Marital_Status: ['Married' 'Single' 'Unknown' 'Divorced']
Income_Category: ['$60K - $80K' 'Less than $40K' '$80K - $120K' '$40K - $60K' '$120K +'
 'Unknown']
Card_Category: ['Blue' 'Gold' 'Silver' 'Platinum']

Education_Level, Marital_Status, and Income_Category have Unknown as a value. This is something we’ll have to remember before we get to the model training, since Unknown isn’t an acceptable value for any of the features.

Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(16, 28))
sns.set(font_scale=1.6)
cols_ = ["Education_Level", "Marital_Status", "Income_Category"]

for ind, col in enumerate(cols_):
    sns.countplot(x=col, data=data, ax=ax[ind])

Checking to see how prevalent Unknown is proportionally to the the other values. Based on the count plots above, it doesn’t look like Unknown is the most common value, but it’s frequency is high enough that we probably don’t want to drop rows containing it altogether.

Dropping irrelevant features from the dataset

data.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1','Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis = 1, inplace = True)

Correlation Matrix

fig, ax = plt.subplots(figsize=(20, 16))
df_corr = data.corr(method="pearson")
mask = np.zeros_like(np.array(df_corr))
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(df_corr, mask=mask, annot=True)

We’re also going to take a look at the correlation matrix to see if there are any features that are too closely tied to others. It looks like Avg_Open_To_Buy is perfectly correlated with Credit_Limit, so we’re going to drop the latter.

data.columnsIndex(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
       'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
       'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

Checking value count of Target feature

data['Attrition_Flag'].value_counts()Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64

The target feature is imbalanced so we will consider F1-score as our metric.

Encoding the independent features

X = data.copy()
X = X.drop(['Credit_Limit'], axis=1) # dropping Credit Limit since it is highly correlated with Avg_Open_To_Buy
y = X.pop('Attrition_Flag')

X['Income_Category'] = X['Income_Category'].replace({'Less than $40K':0,
                                                     '$40K - $60K':1,
                                                     '$60K - $80K':2,
                                                     '$80K - $120K':3,
                                                     '$120K +':4})
X['Card_Category'] = X['Card_Category'].replace({'Blue':0,
                                                 'Silver':1,
                                                 'Gold':2,
                                                 'Platinum':3})
X['Education_Level'] = X['Education_Level'].replace({'Uneducated':0,
                                                     'High School':1,
                                                     'College':2,
                                                     'Graduate':3,
                                                     'Post-Graduate':4,
                                                     'Doctorate':5})

Encoding the Target feature

y = y.replace({'Existing Customer':0,
               'Attrited Customer':1})

Replacing the Unknown values that we saw earlier with the most frequent value encountered in that feature using Simple Imputer.

from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer

def preprocessing(X, y):
    imputer = SimpleImputer(impute_strategy="most_frequent", missing_values="Unknown")
    X = imputer.fit_transform(X, y)
    return X

X = preprocessing(X, y)from evalml.utils import infer_feature_typesX = infer_feature_types(X, feature_types={'Income_Category': 'categorical','Education_Level': 'categorical'})
X

Use infer_feature_types is to specify what types certain columns should be.

To learn more about it you can visit link https://evalml.alteryx.com/en/stable/start.html

Splitting the dataset

Splitting the dataset into 80% train and 20% test.

X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary',test_size=.2)

Initializing AutoMLSearch from EvalML

from evalml import AutoMLSearchautoml = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary", objective="F1", 
                      allowed_model_families=['random_forest' , 'xgboost', 'lightgbm'],
                      additional_objectives=['accuracy binary'], max_batches=5)
automl.search()
Search finished after 02:18            
Best pipeline: XGBoost Classifier w/ Imputer + One Hot Encoder
Best pipeline F1: 0.896504

Pipelines Review

So a lot just happened, let's review the pipelines that were created and tested. We can see that the best performing pipeline was with the LightGBM estimator. We want to learn a little more about it, which can be done with the describe_pipeline function. Notice that the pipeline included a preprocessing step of imputation. In this case, it ended up being unnecessary because of our earlier SimpleImputer and our lack of null values for our numerical features. However AutoMLSearch comes with the built-in capacity to automatically iterate over the hyperparameters for this preprocessing step as well.

Obtaining rankings of model trained

automl.rankings

Obtaining the complete pipeline of the best model

best_pipeline_ = automl.best_pipeline
automl.describe_pipeline(automl.rankings.iloc[1]["id"])

We got the best classifier with LightGBM Classifier.

Predictions on test set

best_pipeline_.fit(X_train, y_train)
predictions = best_pipeline_.predict(X_test)from evalml.model_understanding.graphs import (
    graph_binary_objective_vs_threshold, 
    graph_permutation_importance, 
    graph_confusion_matrix
)

graph_binary_objective_vs_threshold(best_pipeline_, X_test, y_test, "F1")

graph_permutation_importance(best_pipeline_, X_test, y_test, "F1")

Total Trans Ct is giving us the highest permutation importance score followed by Total Trans Amt.

graph_confusion_matrix(y_test, predictions)

We are getting (1685+273) = 1958 correct observations and (52+16) = 68 incorrect observations.

Final Predictions

from evalml.objectives.standard_metrics import AccuracyBinary, AUC, F1, PrecisionWeighted, Recall

acc = AccuracyBinary()
auc = AUC()
f1 = F1()
pre_w = PrecisionWeighted()
rec = Recall()

print(f"Accuracy (Binary): {acc.score(y_true=y_test, y_predicted=predictions)}")
print(f"Area Under Curve: {auc.score(y_true=y_test, y_predicted=predictions)}")
print(f"F1: {f1.score(y_true=y_test, y_predicted=predictions)}")
print(f"Precision (Weighted): {pre_w.score(y_true=y_test, y_predicted=predictions)}")
print(f"Recall: {rec.score(y_true=y_test, y_predicted=predictions)}")

Accuracy (Binary): 0.9664363277393879
Area Under Curve: 0.9152968841857729
F1: 0.8892508143322476
Precision (Weighted): 0.9659845215313323
Recall: 0.84

Result

We are getting an F1-score of 0.88 on the test set which is pretty good. This model can help the manager to plan their business strategies more effectively and go to the customer to provide them better services and turn customers’ decisions in the opposite direction. They can focus on features which gave us the highest importance (Total_trans_Ct, Total_trans_Amt, Total_Revolving_Bal, Total_Relationship_Count).