Website Classification prediction using Naive Bayes Classifier

6 min readJun 13, 2021

Source- https://www.google.com/url?sa=i&url=https%3A%2F%2Fhamlinconstruction.com%2Fapp%2Fnew-hamlin-construction-website%2F&psig=AOvVaw1ZTZY2LV1HoEO8rcm2ev3v&ust=1623684028021000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCPjenaj0lPECFQAAAAAdAAAAABAZ

Problem Statement

Aim of this exercise is to classify the websites based on their content. Here, we have got a dataset which contains list of websites, their kind and text extract from them.

Here, we would be using Natural Language Processing techniques to develop out dataset and then create a Naive Bayes model based on it to classify websites.

Importing necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB

import nltk
#nltk.download()
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')

Reading the dataset

data = pd.read_csv(os.path.join(dirname, filename))
data.head()

Context

This dataset was created by scraping different websites and then classifying them into different categories based on the extracted text.

Content

Below are the values each column has. The column names are pretty self-explanatory.

websiteurl: URL link of the website.

cleanedwebsite_text: the cleaned text content extracted from the websites.

Category: Target feature

Data exploration and understanding

data.shape(1408, 4)

There are 1408 rows and 4 features in the dataset.

Dropping Unnamed column (serial number) as it is of no use hence we would drop it.

data = data.drop('Unnamed: 0', axis=1)
data.head()data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1408 entries, 0 to 1407
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   website_url           1408 non-null   object
 1   cleaned_website_text  1408 non-null   object
 2   Category              1408 non-null   object
dtypes: object(3)

There are no null values in the dataset we can move ahead.

Count of Unique Website URL

data.website_url.nunique()1384

It can be observed that website_url column is having 1384 unique entries. That means we have some duplicate entries in our dataset. We would be dropping those entries.

Value Count of Target feature

data.Category.value_counts(dropna=False)Education                          114
Business/Corporate                 108
Travel                             107
Streaming Services                 104
E-Commerce                         101
Sports                             100
Games                               98
News                                94
Food                                92
Photography                         92
Computers and Technology            91
Health and Fitness                  88
Law and Government                  83
Social Networking and Messaging     80
Adult                               16
Forums                              16
Name: Category, dtype: int64

Category column is having 16 unique entries, which means this is a multi-class classification problem. Hence, we would be using Multinomial Naive Bayes algorithm to solve it. MultinomialNB is efficient algorithm and can handle multi-class problems easily.

cleaned_website_text column contains the content text from the website. We would be using this column to develop features for our model.

Data Preparation

Feature matrix development method:

We would extract words from cleaned_website_text columns and create an array of those words. Out of that array we would filter out english language stop_words which generally doesn't give any meaningful insight about the document. Out of the remaining words we would chose top n most recurring words and consider them as feature.

Next part is to assign feature value to each of the data entry. For assigning feature value, we would use TF-IDF method.

TF-IDF is a method of comparing how relevant a word is to any particular document (here website) from a given bunch of documents.

To know more about TF-IDF visit this link https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

TF−IDF=TF(t,d)∗IDF(t,D)

where,

TF(t,d) = (count of t in d)/(number of words in d)

IDF(t,D)=log(D/(df+1))

t → term of interest

d → document of interest

D → Set of documents

df→ occurrence of t in documents

To summarize, TF-IDF score of any feature for any particular document would be high if that feature term occurs rarely in other documents and/or percentage occurrence for that term in the given document is higher.

Before developing TF-IDF based array we need to filter out words with Parts of Speech (POS) other than Noun. The logic behind it is that Noun gives better representation of website content as compared to other POS.

Here we would be using NLTK library to tokenize the corpus from each website and assign POS tag to each word. Then we would filter in only NOUN tagged words and reverse convert them into corpus. The filtered words corpus would be meaningless for us but can enhance the performance of our model.

Filtering noun words from corpus

filtered_words = []
for i in range(data.shape[0]):
    tagged = nltk.pos_tag(word_tokenize(data.cleaned_website_text[i]), tagset='universal')
    filtered_words.append(' '.join([word_tag[0] for word_tag in tagged if word_tag[1]=='NOUN']))

Universal Part of Speech tags are part-of-speech marks used in Universal Dependencies (UD) which is a project that is developing cross-linguistically consistent treebank annotation for many languages.

Adding filtered words to our Dataframe

data['filtered_words'] = filtered_words
data.head()

Let’s develop TF-IDF based array using filtered_words column from out dataset. Here we have selected some hyperparameters as per following.

min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. We would use 5 as min frequency.
stop_words: We would remove English stop-words if at all they are present in the filtered words. Stopwords adds no meaning to the model.
max_features: Maximum features to be included the the data. We are going to use 1500 features.

Note: Naive Bayes algorithm can easily handle larger number of features as compared to data size.

tfidf = TfidfVectorizer(sublinear_tf=True,
                        min_df=5,
                        stop_words = 'english',
                        max_features = 1500)
feat = tfidf.fit_transform(data.filtered_words)# Feature names

tfidf.get_feature_names()

Converting tf-idf vector to dataframe

X = pd.DataFrame(feat.toarray(), columns = tfidf.get_feature_names())
X.head()

Encoding labels and transforming y

le = preprocessing.LabelEncoder()
y = le.fit_transform(data.Category)
y

Concatenating(X and Category) in a Dataframe

df_f = pd.concat([X, data.Category], axis=1)
df_f.head()

Groupwise plot of category

plt.subplots(figsize=(10,15))
for i, col in enumerate(df_f.columns[0:5]):
    plt.subplot(int(df_f.columns[0:5].shape[0]/2)+1, 2, i+1)
    df_f.groupby('Category').mean()[col].plot(kind='barh', title=col)

Observations from the above plot :-

Term accessories is more related to E-commerce.
Term access is more related to streaming services.
Term ability is more related to business/corporates.
Term academy is more related to law and government.
Term accessibility is more related to Law and Government.

Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                     train_size = 0.7, 
                                     test_size = 0.3, 
                                     random_state = 100,
                                     stratify = y)

Model Building Training and evaluation

model = MultinomialNB()
model.fit(X_train, y_train)y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

Accuracy check

print('Train accuracy:', round(accuracy_score(y_train, y_train_pred),2))
print('Test accuracy:', round(accuracy_score(y_test, y_test_pred),2))

Train accuracy: 0.93
Test accuracy: 0.9

Confusion Matrix of Train set

plt.subplots(figsize=(12,8))
sns.heatmap(confusion_matrix(y_train, y_train_pred),
           annot=True,
           cmap='YlGnBu')

Confusion Matrix of Test set

plt.subplots(figsize=(12,8))
sns.heatmap(confusion_matrix(y_test, y_test_pred),
           annot=True,
           cmap='YlGnBu')

Confusion matrix indicates that,

Model is not able to estimate category 0 and 6 correctly both while training and testing. The reason can be their lesser frequency.
Model is clearly not overfitting which is a good sign.

Conclusion/Recommendations

NLP can be used for text processing and feature development to be used in ML model building.
Multinomial Naive Bayes algorithm can be used to perform multi-class classification with large number of classes.
Naive Bayes in general can handle large number of features as compared to data size quite easily.
For future work, we can increase number of data entries for classes with less frequency and hence improve the model further.
We achieved 90% test accuracy.

Hope you liked the analysis!

You can follow me on Linkedin , Github and Kaggle.

Github Link

https://github.com/ratul442

Kaggle Link

https://www.kaggle.com/ratul6

Linkedin Link

https://www.linkedin.com/in/ratul-ghosh-8048a8148/