Online Shopper’s Intention

This dataset consists of various Information related to customer behavior in online shopping websites. It helps us to perform Marketing Analytics and understand the KPIs and Metrics related to it.

Ratul Ghosh
7 min readJun 25, 2021
Source- https://th.bing.com/th/id/R530f70c8fd1b2e42ced7749f82761c00?rik=4bGi0F9S2ih7Nw&riu=http%3a%2f%2fwww.trendsbuzzer.com%2fwp-content%2fuploads%2f2017%2f10%2fHow-Online-Shopping-Is-Taking-Over.jpg&ehk=CQczWT5Cu7CczKCB6SIywhL7POT6FVYYra0khbhIORE%3d&risl=&pid=ImgRaw

Attribute Information:

The dataset consists of 10 numerical and 8 categorical attributes.
The ‘Revenue’ attribute can be used as the class label.
“Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product Related Duration” represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. The “Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site. The value of “Bounce Rate” feature for a web page refers to the percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session. The value of “Exit Rate” feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. The “Page Value” feature represents the average value for a web page that a user visited before completing an e-commerce transaction. The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Data Set Information:

The dataset consists of feature vectors belonging to 12,330 sessions.
The dataset was formed so that each session
would belong to a different user in a 1-year period to avoid
any tendency to a specific campaign, special day, user
profile, or period.

Importing the required libraries

# import 'Pandas' 
import pandas as pd

# import 'Numpy'
import numpy as np

# import subpackage of Matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# import 'Seaborn'
import seaborn as sns

# to suppress warnings
from warnings import filterwarnings
filterwarnings('ignore')

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None

# to display the float values upto 6 decimal places
pd.options.display.float_format = '{:.6f}'.format
from sklearn.model_selection import train_test_split

# import functions to perform scaling and normalization
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

# import various functions from sklearn
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize'
plt.rcParams['figure.figsize'] = [15,8]
data.shape(12330, 18)

The data has 12330 observations and 18 variables.

Distribution of numeric independent variables

data.drop('Revenue', axis = 1).hist()

# adjust the subplots
plt.tight_layout()
# display the plot
plt.show()

Distribution of dependent variable

df_target = data['Revenue'].copy()
df_target.value_counts()

# plot the countplot of the variable 'Revenue'
sns.countplot(x = df_target)

plt.text(x = 0.95, y = df_target.value_counts()[1] + 1, s = str(round((df_target.value_counts()[1])*100/len(df_target),2)) + '%')
plt.text(x =-0.05, y = df_target.value_counts()[0] +1, s = str(round((df_target.value_counts()[0])*100/len(df_target),2)) + '%')
plt.title('Count Plot for Target Variable (Diagnosis)', fontsize = 15)
plt.xlabel('Target Variable', fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

The above plot shows that there is an imbalance in the target variable.

Missing Value Treatment

Total = data.isnull().sum().sort_values(ascending=False)          
Percent = (data.isnull().sum()*100/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([Total, Percent], axis = 1, keys = ['Total', 'Percentage of Missing Values'])
missing_data
for cols in ['Administrative','Informational','ProductRelated']:
data[cols].replace(0, np.nan, inplace= True)
for cols in ['Administrative','Informational','ProductRelated']:
print('{} null values:'.format(cols),data[cols].isnull().sum(), sep = '\n')
Administrative null values:
5768
Informational null values:
9700
ProductRelated null values:
49

Imputing the null values

for cols in ['Administrative','Informational','ProductRelated']:
median_value= data[cols].median()
data[cols]= data[cols].fillna(median_value)
for cols in ['Administrative','Informational','ProductRelated']:
print('{} null values:'.format(cols),data[cols].isnull().sum(), sep = '\n')

We are left with page duration bounce and exit rates where we have -1 as the minimum value in duration(time cannot be negative) which should be considered as a null value and 0 duration(time can’t be zero) occurs when the page type was 0 which we imputed earlier, so we can convert this into NaN and impute this as well.

For rates, we can impute the the NaN values directly(Here we don’t need to worry about rates being bounce rates having 0 values, as there are many such cases where bounce rates can be 0 because the user must have liked the website and moved onto other web pages towards transactions).

for cols in ['Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates','ExitRates']:
mean_value= data[cols].mean()
data[cols]= data[cols].fillna(mean_value)
for cols in ['Administrative_Duration','Informational_Duration','ProductRelated_Duration','BounceRates','ExitRates']:
print('{} null values:'.format(cols),data[cols].isnull().sum(), sep = '\n')
Administrative_Duration null values:
0
Informational_Duration null values:
0
ProductRelated_Duration null values:
0
BounceRates null values:
0
ExitRates null values:
0

Checking for Skewness and Kurtosis

The skewness and kurtosis for some features are very high probably because they have outliers.

Log Transformation

for i in num_cols:
print('skewness of column {}'.format(i),' ',np.log(data[i]+1).skew())

Square root Transformation

for i in num_cols:
print('skewness of column {}'.format(i),' ',np.sqrt(data[i]).skew())

Applying Square root Transformation

data['Administrative'] = np.sqrt(data['Administrative'])
data['Administrative_Duration'] = np.sqrt(data['Administrative_Duration'])
data['Informational'] = np.sqrt(data['Informational'])
data['Informational_Duration'] = np.sqrt(data['Informational_Duration'])
data['ProductRelated'] = np.sqrt(data['ProductRelated'])
data['ProductRelated_Duration'] = np.sqrt(data['ProductRelated_Duration'])
data['BounceRates'] = np.sqrt(data['BounceRates'])
data['ExitRates'] = np.sqrt(data['ExitRates'])
data['PageValues'] = np.sqrt(data['PageValues'])
data['SpecialDay'] = np.sqrt(data['SpecialDay'])
data['OperatingSystems'] = np.sqrt(data['OperatingSystems'])
data['Browser'] = np.sqrt(data['Browser'])
data['Region'] = np.sqrt(data['Region'])
data['TrafficType'] = np.sqrt(data['TrafficType'])

Treating Outliers of Numerical features

numerical_features=['BounceRates','ExitRates','Administrative_Duration','ProductRelated_Duration']
for cols in numerical_features:
Q1 = data[cols].quantile(0.25)
Q3 = data[cols].quantile(0.75)
IQR = Q3 - Q1

df_1 = (data[cols] >= Q1 - 1.5 * IQR) & (data[cols] <= Q3 + 1.5 *IQR)
data = data.loc[df_1]

Scaling the features

Cat_col = ['Weekend','Revenue','Administrative','Informational','ProductRelated','SpecialDay',
'OperatingSystems','Browser','Region','Month','TrafficType','VisitorType']


feature_scale = [feature for feature in data.columns if feature not in Cat_col]


scaler = StandardScaler()
scaler.fit(data[feature_scale])
scaled_data = pd.concat([data[['Weekend','Revenue','Administrative','Informational',
'ProductRelated','SpecialDay','OperatingSystems',
'Browser','Region','Month','TrafficType','VisitorType']].reset_index(drop=True),
pd.DataFrame(scaler.transform(data[feature_scale]), columns=feature_scale)],
axis=1)
scaled_data.head()

Label Encoding Features

features = ['Month','VisitorType','Weekend','Revenue']
label_encoder = LabelEncoder()
for col in features:
data[col] = label_encoder.fit_transform(scaled_data[col])
data.head()

Splitting the dependent and independent variables

X=label_data.drop(['Revenue'],axis=1)
y=label_data.Revenue

X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8,random_state = 42)
print("Input Training:",X_train.shape)
print("Input Test:",X_test.shape)
print("Output Training:",y_train.shape)
print("Output Test:",y_test.shape)
Input Training: (7996, 17)
Input Test: (2000, 17)
Output Training: (7996,)
Output Test: (2000,)

Applying Random Forest Classifier

rf_classification = RandomForestClassifier(n_estimators = 10, random_state = 42)
rf_model = rf_classification.fit(X_train, y_train)
test_report = get_test_report(rf_model)
print(test_report)

Feature Importance plot

Page Values is giving the highest importance score

Confusion Matrix

ROC Curve

Under-sampling using NearMiss Technique

rf_classification = RandomForestClassifier(n_estimators = 10, random_state = 42)
rf_model_mi = rf_classification.fit(X_train_miss, y_train_miss)
test_report = get_test_report(rf_model_mi)
print(test_report)

Results

With Random Forest classifier we are getting an accuracy of 89% and under-sampling with Near Miss we are getting an accuracy of 72% with 0.33 as precision for class 1 and 0.46 recall.

Hope you liked the analysis!

You can follow me on Linkedin , Github and Kaggle.

Github Link

https://github.com/ratul442

Kaggle Link of this project

Predicting Online Shoppers Intention | Kaggle

Linkedin Link

https://www.linkedin.com/in/ratul-ghosh-8048a8148/

--

--

Ratul Ghosh

Data Scientist at Cyient | Data Science | Analytics | ML | AI | Deep Learning | NLP