Ad Click Prediction

To predict whether customer will click Ad and make a purchase

Ratul Ghosh
5 min readJul 6, 2021

Context

In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.

Description of attributes

Inputs

1. User ID - Customer Unique Id
2. Gender - Gender of a customer - M/F
3. Age - Age of a customer
4. EstimatedSalary - Estimated salary of a customer

Output

5. Purchased - Whether they purchased or not after Ad click 1/0

Evaluation

As this is a classification problem, we will use the classification metrics for evaluating the model.

Reading the Dataset

df = pd.read_csv('Social_Network_Ads.csv')

Count of User ID

len(df['User ID'].unique())

There are 400 user id in the dataset. As User ID is based on the customer ID and is unique by customer, we will drop the User ID.

Countplot of Target feature

plt.figure(figsize=(20,10))
plt.title('Value count of Labels')
sns.countplot(data=df, x='Purchased');

We can observe that the target feature is balanced.

Countplot of Gender

plt.figure(figsize=(20,10))
plt.title('Value count of gender')
sns.countplot(data=df, x='Gender');

Female count is more than male count.

Value count of gender who Purchase or not

plt.figure(figsize=(20,10))
plt.title('Value count of gender who Purchase or not')
sns.countplot(data=df, x='Gender', hue='Purchased');

Most of the male customers are not purchasing after clicking the ad while females have a higher probability of purchasing as compared to males.

Histogram of Age

plt.figure(figsize=(20,10))
plt.title('Histogram of age')
sns.histplot(data=df, x='Age', bins=25, kde=True);

Most of the orders are coming from age range of 34 to 48.

Histogram of Estimated Salary

plt.figure(figsize=(20,10))
plt.title('Histogram of EstimatedSalary')
sns.histplot(data=df, x='EstimatedSalary', bins=25, kde=True);

Most of the orders are coming from salary range between 6000 and 9000.

Plot of Age vs Estimated Salary

plt.figure(figsize=(20,10))
plt.title('Plot of Age vs EstimatedSalary')
sns.boxplot(data=df, x='Age', y='EstimatedSalary');

People in late 40’s and early 50’s have higher estimated salary as they are more experienced people.

Plot of Age vs Estimated Salary vs Purchased or not

plt.figure(figsize=(20,10))
plt.title('Plot of Age vs Estimated Salary vs Purchased or not')
sns.scatterplot(data=df, x='Age', y='EstimatedSalary', hue='Purchased', s=150, alpha=0.5);

People having age more than 30 to 60 have higher chance of buying the products as they are having decent estimated salary, but people below 30 age might be students or unemployed people who are just clicking the ad and not purchasing in return.

Heatmap of Pearson correlation

plt.figure(figsize=(20,20))
plt.title('Heatmap of Pearson corrlation')
sns.heatmap(data=(pd.get_dummies(df)).corr(),annot=True);

Age and Estimated Salary are significant features in contributing towards the target feature Purchased.

Modelling

X = df.drop('Purchased', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = df['Purchased']
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Function for fitting the model and returning the scores

def fit_and_score(models, X_train, X_test, y_train, y_test):
np.random.seed(42)

model_scores = {}

for name, model in models.items():
model.fit(X_train,y_train)
model_scores[name] = model.score(X_test,y_test)

model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
model_scores = model_scores.sort_values('Score')

return model_scores

Models

models = {'LogisticRegression': LogisticRegression(max_iter=10000),
'KNeighborsClassifier': KNeighborsClassifier(),
'SVC': SVC(),
'DecisionTreeClassifier': DecisionTreeClassifier(),
'RandomForestClassifier': RandomForestClassifier(),
'AdaBoostClassifier': AdaBoostClassifier(),
'GradientBoostingClassifier': GradientBoostingClassifier(),
'XGBClassifier': XGBClassifier(),
'XGBRFClassifier': XGBRFClassifier(),
'LGBMClassifier':LGBMClassifier()}
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)
baseline_model_scores
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Score')
plt.xticks(rotation=90);

From the baseline model we can see that the top models are:

  1. SVC- 0.933333
  2. XGBRF Classifier- 0.933333

Hyperparameter Tuning

Random Search CV

from sklearn.model_selection import RandomizedSearchCV

In [30]:

linkcode

def randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
np.random.seed(42)

model_rs_scores = {}
model_rs_best_param = {}

for name, model in models.items():
rs_model = RandomizedSearchCV(model,
param_distributions=params[name],
scoring='f1',
cv=5,
n_iter=20,n_jobs=1,
verbose=0)
rs_model.fit(X_train,y_train)
model_rs_scores[name] = rs_model.score(X_test,y_test)
model_rs_best_param[name] = rs_model.best_params_

return model_rs_scores, model_rs_best_param
models = {'SVC': SVC(),
'XGBRFClassifier': XGBRFClassifier()}

params = {'SVC':{'C' : np.linspace(0.1,0.9, 9),
'kernel':['linear', 'ploy', 'rbf', 'sigmoid'],
'gamma': np.linspace(0,1,11),
},
'XGBRFClassifier':{'n_estimators': [2,5,10,20,50,100,200],
'learning_rate':np.linspace(0,1,11),
'gamma': np.linspace(0,1,11)}
}
model_rs_scores_1, model_rs_best_param_1 = randomsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)model_rs_scores_1
{'SVC': 0.9166666666666666, 'XGBRFClassifier': 0.9072164948453608}
model_rs_best_param_1
{'SVC': {'kernel': 'rbf', 'gamma': 0.7000000000000001, 'C': 0.6},
'XGBRFClassifier': {'n_estimators': 5, 'learning_rate': 0.8, 'gamma': 0.9}}

From the random search CV of 5, we found that the SVC model performs the best with a CV F1 mean score of 91.67% we will based the model evaluation on the SVC.

Model Evaluation

from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve
from sklearn.model_selection import cross_val_score
model = SVC(kernel='rbf',
gamma=0.7,
C = 0.6)
model.fit(X_train,y_train)
y_preds = model.predict(X_test)

Classification Report

print(classification_report(y_test,y_preds))

Confusion Matrix

plot_confusion_matrix(model,X_test,y_test)

ROC Curve

plot_roc_curve(model,X_test,y_test)

Evaluation using cross-validation

def get_cv_score(model, X, y, cv=5):


cv_accuracy = cross_val_score(model,X,y,cv=5,scoring='accuracy')
print(f'Cross Validaion accuracy Scores: {cv_accuracy}')
print(f'Cross Validation accuracy Mean Score: {cv_accuracy.mean()}')

cv_precision = cross_val_score(model,X,y,cv=5,
scoring='precision')
print(f'Cross Validaion precision Scores: {cv_precision}')
print(f'Cross Validation precision Mean Score: {cv_precision.mean()}')

cv_recall = cross_val_score(model,X,y,cv=5,
scoring='recall')
print(f'Cross Validaion recall Scores: {cv_recall}')
print(f'Cross Validation recall Mean Score: {cv_recall.mean()}')

cv_f1 = cross_val_score(model,X,y,cv=5,scoring='f1')
print(f'Cross Validaion f1 Scores: {cv_f1}')
print(f'Cross Validation f1 Mean Score: {cv_f1.mean()}')

cv_metrics = pd.DataFrame({'Accuracy': cv_accuracy.mean(),
'Precision': cv_precision.mean(),
'Recall': cv_recall.mean(),
'f1': cv_recall.mean()},index=[0])

return cv_metrics
cv_metrics = get_cv_score(model, X_train, y_train, cv=5)
cv_metrics

Result

With the SVC model, we are able to get the following:

Accuracy    0.903571 
Precision 0.841834
Recall 0.895789
f1 0.895789

By this model we can predict the purchase behavior of customers and obtain relevant insights.

Hope you liked the analysis!

You can follow me on Linkedin , Github and Kaggle.

Github Link of this project

https://github.com/ratul442/Ad-click-Prediction

Kaggle Link

https://www.kaggle.com/ratul6

Linkedin Link

--

--

Ratul Ghosh

Data Scientist at Cyient | Data Science | Analytics | ML | AI | Deep Learning | NLP