Bank Churn Prediction using EvalML Library
About this Dataset
Dataset Link — https://www.kaggle.com/sakshigoyal7/credit-card-customers
A manager at the bank is disturbed with more and more customers leaving their credit card services. They would really appreciate if one could predict for them who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction
This dataset from a website with the URL as https://leaps.analyttica.com/home.
Now, this dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 18 features.
We have only 16.07% of customers who have churned. Thus, it’s a bit difficult to train our model to predict churning customers.
Feature Descriptions
CLIENTNUM: Client number. Unique identifier for the customer holding the account
Customer_Age: Demographic variable — Customer’s Age in Years
Gender: Demographic variable — M=Male, F=Female
Dependent_count: Demographic variable — Number of dependents
Education_Level: Demographic variable — Educational Qualification of the account holder (example: high school, college graduate, etc.)
Marital_Status: Demographic variable — Married, Single, Divorced, Unknown
Income_Category: Demographic variable — Annual Income Category of the account holder (< 40K,40K,40K — 60K, 60K−60K−80K, 80K−80K−120K, > $120K, Unknown)
Card_Category: Product Variable — Type of Card (Blue, Silver, Gold, Platinum)
Months_on_book: Period of relationship with bank
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: Total Revolving Balance on the Credit Card
Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio: Average Card Utilization Ratio
Target
Attrition_Flag: Internal event (customer activity) variable — if the account is closed then 1(Attrited Customer) else 0(Existing Customer)
About EvalML Library
EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.
Key Functionality
- Automation — Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
- Data Checks — Catches and warns of problems with your data and problem setup before modeling.
- End-to-end — Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
- Model Understanding — Provides tools to understand and introspect on models, to learn how they’ll behave in your problem domain.
- Domain-specific — Includes repository of domain-specific objective functions and an interface to define your own.
To learn more about this library you can visit this link https://pypi.org/project/evalml/
Installing EvalML Library
pip install evalml
Importing necessary libraries
import evalml
import numpy as np
import pandas as pd
Reading the dataset
data = pd.read_csv('../input/credit-card-customers/BankChurners.csv')
data.head()data.shape(10127, 23)
There are 10127 rows and 23 features in the dataset.
Data Preparation
The first thing we'll do is drop CLIENTNUM from the data since a unique client identifier will have no correlation with attrition rates. Now there's clearly some diversity in the types of features, and at first glace it looks like we don't have to worry about any null or missing values. But that seems unlikely with a dataset of this size.
data = data.drop(['CLIENTNUM'], axis=1)for feature in data.columns:
if data[feature].dtype not in ['int64', 'float64']:
print(f'{feature}: {data[feature].unique()}')Attrition_Flag: ['Existing Customer' 'Attrited Customer']
Gender: ['M' 'F']
Education_Level: ['High School' 'Graduate' 'Uneducated' 'Unknown' 'College' 'Post-Graduate'
'Doctorate']
Marital_Status: ['Married' 'Single' 'Unknown' 'Divorced']
Income_Category: ['$60K - $80K' 'Less than $40K' '$80K - $120K' '$40K - $60K' '$120K +'
'Unknown']
Card_Category: ['Blue' 'Gold' 'Silver' 'Platinum']
Education_Level, Marital_Status, and Income_Category have Unknown as a value. This is something we’ll have to remember before we get to the model training, since Unknown isn’t an acceptable value for any of the features.
Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(16, 28))
sns.set(font_scale=1.6)
cols_ = ["Education_Level", "Marital_Status", "Income_Category"]
for ind, col in enumerate(cols_):
sns.countplot(x=col, data=data, ax=ax[ind])
Checking to see how prevalent Unknown is proportionally to the the other values. Based on the count plots above, it doesn’t look like Unknown is the most common value, but it’s frequency is high enough that we probably don’t want to drop rows containing it altogether.
Dropping irrelevant features from the dataset
data.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1','Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis = 1, inplace = True)
Correlation Matrix
fig, ax = plt.subplots(figsize=(20, 16))
df_corr = data.corr(method="pearson")
mask = np.zeros_like(np.array(df_corr))
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(df_corr, mask=mask, annot=True)
We’re also going to take a look at the correlation matrix to see if there are any features that are too closely tied to others. It looks like Avg_Open_To_Buy is perfectly correlated with Credit_Limit, so we’re going to drop the latter.
data.columnsIndex(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')
Checking value count of Target feature
data['Attrition_Flag'].value_counts()Existing Customer 8500
Attrited Customer 1627
Name: Attrition_Flag, dtype: int64
The target feature is imbalanced so we will consider F1-score as our metric.
Encoding the independent features
X = data.copy()
X = X.drop(['Credit_Limit'], axis=1) # dropping Credit Limit since it is highly correlated with Avg_Open_To_Buy
y = X.pop('Attrition_Flag')
X['Income_Category'] = X['Income_Category'].replace({'Less than $40K':0,
'$40K - $60K':1,
'$60K - $80K':2,
'$80K - $120K':3,
'$120K +':4})
X['Card_Category'] = X['Card_Category'].replace({'Blue':0,
'Silver':1,
'Gold':2,
'Platinum':3})
X['Education_Level'] = X['Education_Level'].replace({'Uneducated':0,
'High School':1,
'College':2,
'Graduate':3,
'Post-Graduate':4,
'Doctorate':5})
Encoding the Target feature
y = y.replace({'Existing Customer':0,
'Attrited Customer':1})
Replacing the Unknown values that we saw earlier with the most frequent value encountered in that feature using Simple Imputer.
from evalml.pipelines.components.transformers.imputers.simple_imputer import SimpleImputer
def preprocessing(X, y):
imputer = SimpleImputer(impute_strategy="most_frequent", missing_values="Unknown")
X = imputer.fit_transform(X, y)
return X
X = preprocessing(X, y)from evalml.utils import infer_feature_typesX = infer_feature_types(X, feature_types={'Income_Category': 'categorical','Education_Level': 'categorical'})
X
Use infer_feature_types
is to specify what types certain columns should be.
To learn more about it you can visit link https://evalml.alteryx.com/en/stable/start.html
Splitting the dataset
Splitting the dataset into 80% train and 20% test.
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary',test_size=.2)
Initializing AutoMLSearch from EvalML
from evalml import AutoMLSearchautoml = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type="binary", objective="F1",
allowed_model_families=['random_forest' , 'xgboost', 'lightgbm'],
additional_objectives=['accuracy binary'], max_batches=5)
automl.search()
Search finished after 02:18
Best pipeline: XGBoost Classifier w/ Imputer + One Hot Encoder
Best pipeline F1: 0.896504
Pipelines Review
So a lot just happened, let's review the pipelines that were created and tested. We can see that the best performing pipeline was with the LightGBM estimator. We want to learn a little more about it, which can be done with the describe_pipeline function. Notice that the pipeline included a preprocessing step of imputation. In this case, it ended up being unnecessary because of our earlier SimpleImputer and our lack of null values for our numerical features. However AutoMLSearch comes with the built-in capacity to automatically iterate over the hyperparameters for this preprocessing step as well.
Obtaining rankings of model trained
automl.rankings
Obtaining the complete pipeline of the best model
best_pipeline_ = automl.best_pipeline
automl.describe_pipeline(automl.rankings.iloc[1]["id"])
We got the best classifier with LightGBM Classifier.
Predictions on test set
best_pipeline_.fit(X_train, y_train)
predictions = best_pipeline_.predict(X_test)from evalml.model_understanding.graphs import (
graph_binary_objective_vs_threshold,
graph_permutation_importance,
graph_confusion_matrix
)
graph_binary_objective_vs_threshold(best_pipeline_, X_test, y_test, "F1")
graph_permutation_importance(best_pipeline_, X_test, y_test, "F1")
Total Trans Ct is giving us the highest permutation importance score followed by Total Trans Amt.
graph_confusion_matrix(y_test, predictions)
We are getting (1685+273) = 1958 correct observations and (52+16) = 68 incorrect observations.
Final Predictions
from evalml.objectives.standard_metrics import AccuracyBinary, AUC, F1, PrecisionWeighted, Recall
acc = AccuracyBinary()
auc = AUC()
f1 = F1()
pre_w = PrecisionWeighted()
rec = Recall()
print(f"Accuracy (Binary): {acc.score(y_true=y_test, y_predicted=predictions)}")
print(f"Area Under Curve: {auc.score(y_true=y_test, y_predicted=predictions)}")
print(f"F1: {f1.score(y_true=y_test, y_predicted=predictions)}")
print(f"Precision (Weighted): {pre_w.score(y_true=y_test, y_predicted=predictions)}")
print(f"Recall: {rec.score(y_true=y_test, y_predicted=predictions)}")
Accuracy (Binary): 0.9664363277393879
Area Under Curve: 0.9152968841857729
F1: 0.8892508143322476
Precision (Weighted): 0.9659845215313323
Recall: 0.84
Result
We are getting an F1-score of 0.88 on the test set which is pretty good. This model can help the manager to plan their business strategies more effectively and go to the customer to provide them better services and turn customers’ decisions in the opposite direction. They can focus on features which gave us the highest importance (Total_trans_Ct, Total_trans_Amt, Total_Revolving_Bal, Total_Relationship_Count).
Hope you liked the analysis!
You can follow me on Linkedin , Github and Kaggle.
Github Link of this project
https://github.com/ratul442/Bank-Churn-Prediction-using-EvalML
Kaggle Link
https://www.kaggle.com/ratul6/bank-churn-prediction-using-evalml