Lead Prediction using Pycaret Library



An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Pycaret Library

PyCaret is an open source, low -code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.

Fore more details check out their page


Variables Description

  • Prospect ID — A unique ID with which the customer is identified.
  • Lead Number — A lead number assigned to each lead procured.
  • Lead Origin — The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
  • Lead Source — The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
  • Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.
  • Do Not Call — An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.
  • Converted — The target variable. Indicates whether a lead has been successfully converted or not.
  • Total Visits — The total number of visits made by the customer on the website.
  • Total Time Spent on Website — The total time spent by the customer on the website.
  • Page Views Per Visit — Average number of pages on the website viewed during the visits.
  • Last Activity — Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
  • Country — The country of the customer.
  • Specialization — The industry domain in which the customer worked before. Includes the level ‘Select Specialization’ which means the customer had not selected this option while filling the form.
  • How did you hear about X Education — The source from which the customer heard about X Education.
  • What is your current occupation — Indicates whether the customer is a student, unemployed or employed.
  • What matters most to you in choosing this course An option selected by the customer — indicating what is their main motto behind doing this course.
  • Search — Indicating whether the customer had seen the ad in any of the listed items.
  • Magazine
  • Newspaper Article
  • X Education Forums
  • Newspaper
  • Digital Advertisement
  • Through Recommendations — Indicates whether the customer came in through recommendations.
  • Receive More Updates About Our Courses — Indicates whether the customer chose to receive more updates about the courses.
  • Tags — Tags assigned to customers indicating the current status of the lead.
  • Lead Quality — Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead.
  • Update me on Supply Chain Content — Indicates whether the customer wants updates on the Supply Chain Content.
  • Get updates on DM Content — Indicates whether the customer wants updates on the DM Content.
  • Lead Profile — A lead level assigned to each customer based on their profile.
  • City — The city of the customer.
  • Asymmetric Activity Index — An index and score assigned to each customer based on their activity and their profile
  • Asymmetric Profile Index
  • Asymmetric Activity Score
  • Asymmetric Profile Score
  • I agree to pay the amount through cheque — Indicates whether the customer has agreed to pay the amount through cheque or not.
  • a free copy of Mastering The Interview — Indicates whether the customer wants a free copy of ‘Mastering the Interview’ or not.
  • Last Notable Activity — The last notable activity performed by the student.


UpGrad Case Study

Installing Pycaret Library

!pip install pycaret

Reading the dataset

dataset.shape(9240, 37)

The dataset has 9240 rows and 37 features.

Information about the dataset

dataset.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Prospect ID 9240 non-null object
1 Lead Number 9240 non-null int64
2 Lead Origin 9240 non-null object
3 Lead Source 9204 non-null object
4 Do Not Email 9240 non-null object
5 Do Not Call 9240 non-null object
6 Converted 9240 non-null int64
7 TotalVisits 9103 non-null float64
8 Total Time Spent on Website 9240 non-null int64
9 Page Views Per Visit 9103 non-null float64
10 Last Activity 9137 non-null object
11 Country 6779 non-null object
12 Specialization 7802 non-null object
13 How did you hear about X Education 7033 non-null object
14 What is your current occupation 6550 non-null object
15 What matters most to you in choosing a course 6531 non-null object
16 Search 9240 non-null object
17 Magazine 9240 non-null object
18 Newspaper Article 9240 non-null object
19 X Education Forums 9240 non-null object
20 Newspaper 9240 non-null object
21 Digital Advertisement 9240 non-null object
22 Through Recommendations 9240 non-null object
23 Receive More Updates About Our Courses 9240 non-null object
24 Tags 5887 non-null object
25 Lead Quality 4473 non-null object
26 Update me on Supply Chain Content 9240 non-null object
27 Get updates on DM Content 9240 non-null object
28 Lead Profile 6531 non-null object
29 City 7820 non-null object
30 Asymmetrique Activity Index 5022 non-null object
31 Asymmetrique Profile Index 5022 non-null object
32 Asymmetrique Activity Score 5022 non-null float64
33 Asymmetrique Profile Score 5022 non-null float64
34 I agree to pay the amount through cheque 9240 non-null object
35 A free copy of Mastering The Interview 9240 non-null object
36 Last Notable Activity 9240 non-null object
dtypes: float64(4), int64(3), object(30)

Preprocessing the dataset

Taking 70% sample data for training and remaining 30% for testing

data = dataset.sample(frac=0.70, random_state=42)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))

Data for Modeling: (6468, 37)

print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Unseen Data For Predictions: (2772, 37)

Importing all classification libraries from Pycaret

from pycaret.classification import *

Initializing the setup

Target we need to predict is Converted.

clf1 = setup(data = data, target = 'Converted', session_id=1)

More more details how pycaret setups the machine learning experiment you can follow this link .

Comparing the results of different classifiers

best_model = compare_models()

Light GBM Classifier gives us the best result moving ahead with LGBM classifier.

lightgbm = create_model('lightgbm')

Printing the Default Hyperparameters of LGBM

print(lightgbm)LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=1, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Applying hyperparameter tuning

tuned_lightgbm = tune_model(lightgbm)

Plotting roc-auc curve for lightgbm

plot_model(tuned_lightgbm, plot = 'auc')

Feature Importance Plot

plot_model(tuned_lightgbm, plot='feature')

The variable importance of Lead Number is the highest. Lead number is the number assigned to each lead provided.

Precision Recall Curve

plot_model(tuned_lightgbm, plot = 'pr')

Confusion Matrix

plot_model(tuned_lightgbm, plot = 'confusion_matrix')

Below is the code to check all the evaluation metrics and charts.


Predicting the model


Validating the model on the unseen data

unseen_predictions = predict_model(tuned_lightgbm, data=data_unseen)

We generate two new features after this step Label and Score.

Checking the Accuracy on unseen data

from pycaret.utils import check_metric
check_metric(unseen_predictions['Converted'], unseen_predictions['Label'], metric = 'Accuracy')

Accuracy is 0.9455% on unseen data.

Saving the model pipeline

save_model(tuned_lightgbm,'./model')Transformation Pipeline and Model Succesfully Saved(Pipeline(memory=None,
display_types=True, features_todrop=[],
numerical_features=[], target='Converted',
boosting_type='gbdt', class_weight=None,
colsample_bytree=1.0, feature_fraction=0.5,
importance_type='split', learning_rate=0.3,
max_depth=-1, min_child_samples=26,
min_child_weight=0.001, min_split_gain=0.8,
n_estimators=230, n_jobs=-1, num_leaves=100,
objective=None, random_state=1, reg_alpha=0.005,
reg_lambda=4, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)]],
verbose=False), './model.pkl')
saved_final_lightgbm_model = load_model('./model')

We can further deploy this model on cloud or on-premises for real time or batch predictions.


We were able to achieve 94% accuracy on test set and 0.99 AUC for both classes. This model can help X Education to plan their business strategies more effectively and expand their growth like focusing more on features which gave us the highest importance.

Hope you liked the analysis!

You can follow me on Linkedin , Github and Kaggle.

Github Link for this project

Kaggle Link

Linkedin Link




Software Engineer at Cyient | Data Science | Analytics | ML | AI | Deep Learning | NLP

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Trial by Data Podcast: Special Episode! Is There a Perfect Wearable Device?

Welcom Tariq!

Headshot of Tariq Khokar

NFL Play Calling

A Medic’s Machine Learning Diary: Day 1


Topology of Business: A Knowledge Graph of Federal Tax Service

Solve an English cloze test in python with FitBERT

DataKind UK’s new five-year strategy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ratul Ghosh

Ratul Ghosh

Software Engineer at Cyient | Data Science | Analytics | ML | AI | Deep Learning | NLP

More from Medium

Benchmark Analysis of Algorithms for Customer Churn Analysis In Telecommunication Sector

Surbana Jurong’s AI Empowers PUB Operators to Anticipate Water Quality Anomalies with Confidence

Loan Defaulting Tendency Prediction — End-to-End ML implementation

CountVectorizer vs TfidfVectorizer