Customer Cart Abandonment Analysis using Machine Learning

Ratul Ghosh

6 min readJun 11, 2021

Context

Abandonment is an e-commerce term used to describe a visitor on a web page who leaves that page before completing the desired action. Examples of abandonment include shopping cart abandonment, referring to visitors who add items to their online shopping cart, but exit without completing the purchase.

About the dataset

The dataset consists of 13 attributes:

ID: The session id of the customer.

2. Is_Product_Details_viewed: Whether the customer is viewing the product details or not.

3. Session_Activity_Count: How many times a customer is going to the different pages.

4. No_Items_Added_InCart: Number of items in cart.

5. No_Items_Removed_FromCart: Number of items removed from the cart.

6. No_Cart_Viewed: How many times the customer is going to the cart page.

7. No_Checkout_Confirmed: How many times the checkout has been confirmed successfully by the customer.

8. No_Checkout_Initiated: How many times the checkout(successful as well as unsuccess) is being done by the user.

9. No_Cart_Items_Viewed: How many times a user is viewing the product from cart.

10. No_Customer_Login: Number of times the customer had did log in.

11. No_Page_Viewed: Number of pages viewed by the customer.

12. Customer_Segment_Type: The customer falls under which category,i.e, 0 for Target Customer, 1 for Loyal Customer, and 2 for Untargeted customer.

13. Cart_Abandoned: Whether the customer is doing cart abandonment or not. This is the target variable that we need to predict.

The dataset used in this case is real and was taken from an organization for research purposes. The dataset contains 4284 rows and 13 columns.

Business Understanding

If a customer will visits the webpage then at that time we have to predict whether the customer will do abandonment or not. By doing so, we can give those customers some suitable offers so that cart abandonment ratio will decrease which in term increase in profits.

Import Libraries

import pickle as pkl
import pandas as pd
import numpy as np
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix,accuracy_score,cohen_kappa_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from scipy.special import boxcox1p
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
%matplotlib inline

Data Understanding

Reading the Dataset

dataset = pd.read_csv('data_cart_abandonment.csv')
dataset.head()dataset.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4284 entries, 0 to 4283
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         4284 non-null   object 
 1   Is_Product_Details_viewed  4284 non-null   object 
 2   Session_Activity_Count     4284 non-null   int64  
 3   No_Items_Added_InCart      4275 non-null   float64
 4   No_Items_Removed_FromCart  4284 non-null   int64  
 5   No_Cart_Viewed             4275 non-null   float64
 6   No_Checkout_Confirmed      4284 non-null   int64  
 7   No_Checkout_Initiated      4284 non-null   int64  
 8   No_Cart_Items_Viewed       4284 non-null   int64  
 9   No_Customer_Login          4284 non-null   int64  
 10  No_Page_Viewed             4284 non-null   int64  
 11  Customer_Segment_Type      4284 non-null   int64  
 12  Cart_Abandoned             4284 non-null   int64  
dtypes: float64(2), int64(9), object(2)
memory usage: 435.2+ KB

Countplot of Cart Abandoned

sns.countplot(dataset.Cart_Abandoned)

Exploratory Data Analysis

Univariate Analysis

num = dataset.select_dtypes(include=["float64","int64"])
cat = dataset.select_dtypes(include
["object","category"]).drop(["ID"],axis=1)

f, ax = plt.subplots(nrows=1,ncols=3,figsize=(20, 8))
for i,j in zip(cat.columns.tolist(), ax.flatten()):
    sns.countplot(x=cat[i],ax=j)

Bivariate Analysis

FOR NUMERICAL : Numerical Attributes Vs Cart_Abandoned : By BoxPlot

fig, ax = plt.subplots(2, 5, figsize=(20, 10))
for var, subplot in zip(num.columns.tolist(), ax.flatten()):
    sns.boxplot(x=cat["Cart_Abandoned"], y=num[var], ax=subplot)

FOR CATEGORICAL : Is_Product_Details_viewed Vs Cart_Abandoned

sns.countplot(x=dataset.Is_Product_Details_viewed,hue=dataset.Cart_Abandoned)

Correlation plot of Independent attributes

corr = num.corr()
sns.heatmap(corr)

Data Preparation

data.isna().sum()ID                           0
Is_Product_Details_viewed    0
Session_Activity_Count       0
No_Items_Added_InCart        9
No_Items_Removed_FromCart    0
No_Cart_Viewed               9
No_Checkout_Confirmed        0
No_Checkout_Initiated        0
No_Cart_Items_Viewed         0
No_Customer_Login            0
No_Page_Viewed               0
Customer_Segment_Type        0
Cart_Abandoned               0
dtype: int64No_Items_Added_InCart and No_Cart_Viewed have missing values.

Imputing missing values with mean as they are continuous features

data['No_Items_Added_InCart'] = data['No_Items_Added_InCart'].fillna(data['No_Items_Added_InCart'].mean())data['No_Cart_Viewed'] = data['No_Cart_Viewed'].fillna(data['No_Cart_Viewed'].mean())

Taking care of Outliers by Normalizing the Data : By MinMax Scaling

num=data.select_dtypes(include=["int64"])
cat=data.select_dtypes(include=["object","category"]).drop(["ID"],axis=1)min_max_scaler = MinMaxScaler()x_scaled = min_max_scaler.fit_transform(num)
x_scaleddf_scaled = pd.DataFrame(x_scaled,columns=num.columns)
df_scaled.head(3)df_scaled.hist(bins=15, figsize=(20, 6), layout=(2, 5));

fig, ax = plt.subplots(2, 5, figsize=(20, 10))
for var, subplot in zip(df_scaled.columns.tolist(), ax.flatten()):
    sns.boxplot(y=df_scaled[var], ax=subplot)

Taking care of Outliers by Normalizing the Data : By BoxCox Normalization

fig, ax = plt.subplots(2, 5, figsize=(20, 10))
for var, subplot in zip(df_scaled_boxcox.columns.tolist(), ax.flatten()):
    sns.boxplot(y=df_scaled_boxcox[var], ax=subplot)df_scaled_boxcox.hist(bins=15, figsize=(20, 6), layout=(2, 5));

df_scaled_boxcox["ID"]=data.ID
df_scaled_boxcox.set_index('ID',inplace=True)
df_scaled_boxcox.reset_index(inplace=True)df_final=df_scaled_boxcox.join(cat)
df_final.head(3)df_final.Is_Product_Details_viewed.replace({"Yes":1,"No":0},inplace=True)
df_final.head(3)df_final.Is_Product_Details_viewed = pd.Categorical(df_final.Is_Product_Details_viewed)

Feature Selection

Applying Recursive Feature Elimination

X=df_final.iloc[:,1:12]
X.shapey=df_final["Cart_Abandoned"]
y.namelr = LogisticRegression()
lr.fit(X,y)
#stop the search when only the last feature is left
rfe = RFE(lr, n_features_to_select=7, verbose = 3 )
fit=rfe.fit(X,y)
print("Num Features: %d"% fit.n_features_) 
print("Selected Features: %s"% fit.support_) 
print("Feature Ranking: %s"% fit.ranking_)Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Num Features: 7
Selected Features: [ True False False  True  True  True False  True  True  True False]
Feature Ranking: [1 2 4 1 1 1 5 1 1 1 3]l = [i for i,x in enumerate(list(fit.support_)) if x == True]X.columnsIndex(['Session_Activity_Count', 'No_Items_Added_InCart',
       'No_Items_Removed_FromCart', 'No_Cart_Viewed', 'No_Checkout_Confirmed',
       'No_Checkout_Initiated ', 'No_Cart_Items_Viewed', 'No_Customer_Login',
       'No_Page_Viewed', 'Is_Product_Details_viewed', 'Customer_Segment_Type'],
      dtype='object')feature_selected = [X[X.columns[l[i]]].name for i,x in enumerate(l)]
feature_selected['Session_Activity_Count',
 'No_Cart_Viewed',
 'No_Checkout_Confirmed',
 'No_Checkout_Initiated ',
 'No_Customer_Login',
 'No_Page_Viewed',
 'Is_Product_Details_viewed']

Random Forest Classifier

clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)

# Train the classifier
clf.fit(X, y)
feature_weightage_dict = dict()
# Print the name and gini importance of each feature
for feature in zip(X.columns, clf.feature_importances_):
    feature_weightage_dict.update({feature[0]:feature[1]})feature_weightage_dict{'Customer_Segment_Type': 0.0059418012185472185,
 'Is_Product_Details_viewed': 0.009764913736069736,
 'No_Cart_Items_Viewed': 0.018703955749515146,
 'No_Cart_Viewed': 0.017810335296951977,
 'No_Checkout_Confirmed': 0.7341868777064404,
 'No_Checkout_Initiated ': 0.0441808380652188,
 'No_Customer_Login': 0.05604221107820115,
 'No_Items_Added_InCart': 0.025547984898027527,
 'No_Items_Removed_FromCart': 0.010172300692196705,
 'No_Page_Viewed': 0.027120211503535017,
 'Session_Activity_Count': 0.05052857005529642}sorted_feature_weightage_dict = sorted(feature_weightage_dict.items(), key=lambda kv: kv[1], reverse = True)
sorted_feature_weightage_dict[('No_Checkout_Confirmed', 0.7341868777064404),
 ('No_Customer_Login', 0.05604221107820115),
 ('Session_Activity_Count', 0.05052857005529642),
 ('No_Checkout_Initiated ', 0.0441808380652188),
 ('No_Page_Viewed', 0.027120211503535017),
 ('No_Items_Added_InCart', 0.025547984898027527),
 ('No_Cart_Items_Viewed', 0.018703955749515146),
 ('No_Cart_Viewed', 0.017810335296951977),
 ('No_Items_Removed_FromCart', 0.010172300692196705),
 ('Is_Product_Details_viewed', 0.009764913736069736),
 ('Customer_Segment_Type', 0.0059418012185472185)]df_final.columnsIndex(['ID', 'Session_Activity_Count', 'No_Items_Added_InCart',
       'No_Items_Removed_FromCart', 'No_Cart_Viewed', 'No_Checkout_Confirmed',
       'No_Checkout_Initiated ', 'No_Cart_Items_Viewed', 'No_Customer_Login',
       'No_Page_Viewed', 'Is_Product_Details_viewed', 'Customer_Segment_Type',
       'Cart_Abandoned'],
      dtype='object')

Now comparing both the models we can see the No_Items_Added_InCart, No_Checkout_Confirmed, No_Checkout_Initiated, No_Customer_Login and No_Page_Viewed has the higher feature importance. So we will only consider these features to train our model.

Train-Validation Split (60% Train set and 40% Validation set)

X = df_final.iloc[:,[5,6,8,9,2]]
y = df_final.loc[:,["Cart_Abandoned"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=0)

Over-Sampling Using SMOTE

sm = SMOTE(random_state=2,k_neighbors=5)
X_train, y_train =sm.fit_sample(X_train,y_train)

Train-Validation Split after SMOTE

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_train, y_train, test_size=0.40, random_state=0)

Model Building And Prediction

lr1 = LogisticRegression()
lr1.fit(X_train_new,y_train_new)

y_pred_new = lr1.predict(X_test_new)  #For SMOTE validation samples
y_pred=lr1.predict(X_test)            #For actual validation samples

Evaluation

print(" accuracy is %2.3f" % accuracy_score(y_test_new, y_pred_new))
print(" Kappa score is %f" %cohen_kappa_score(y_test_new, y_pred_new))

accuracy is 0.988
Kappa score is 0.976122

print(" accuracy is %2.3f" % accuracy_score(y_test, y_pred))
print(" Kappa is %f" %cohen_kappa_score(y_test, y_pred))

accuracy is 0.984
Kappa score is 0.936154

Dump model to pickle file

lr1.predict(X_test)array([1, 1, 1, ..., 0, 1, 1])pkl_out = open("train_classifier","wb")
pkl.dump(lr1,pkl_out)

Conclusion

By using this analysis we can predict whether a customer will do abandonment or not in the future. So that we can able to give them some offers so that they able to complete the checkout successfully.

You can follow me on Linkedin , Github and Kaggle.

Github Link

https://github.com/ratul442/Customer-Cart-Abandonment-Analysis

Kaggle Link

https://www.kaggle.com/ratul6/code

Linkedin Link

https://www.linkedin.com/in/ratul-ghosh-8048a8148/