Topic Modeling for Research Articles
JANATA-HACK INDEPENDENCE DAY ML HACKATHON
Problem Statement
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set.
Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:
- Computer Science
- Physics
- Mathematics
- Statistics
- Quantitative Biology
- Quantitative Finance
Description
Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
Importing the dataset
train = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv')
test = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv')
samp = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/sample_submission_UVKGLZE.csv')
LABELWISE COUNT
l = ['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance']for col in l:
print(col,':\n',train[col].value_counts())
dic = {'CS' :8594,'Phy' :6013,'Math' :5618,'Stats' :5206,'QB' :587,'QF' :249}
values = dic.values()
total = sum(values)
percent_values = [value * 100. / total for value in values]
print(percent_values)[32.71785891041992, 22.891841474092967, 21.388053451098337, 19.819545437240645, 2.234743213918605, 0.9479575132295276]
Splitting into train and validation set
X = train.loc[:,['TITLE','ABSTRACT']]
y = train.loc[:,l]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.16, random_state=42)test = test.drop(['ID'],axis=1)y_test.reset_index(drop=True,inplace=True)
X_test.reset_index(drop=True,inplace=True)y1 = np.array(y_train)
y2 = np.array(y_test)
Text Cleaning
X_train.replace('[^a-zA-Z]',' ', regex=True, inplace=True)
X_test.replace('[^a-zA-Z]',' ', regex=True, inplace=True)
test.replace('[^a-zA-Z]',' ', regex=True, inplace=True)for index in X_train.columns:
X_train[index] = X_train[index].str.lower()
for index in X_test.columns:
X_test[index] = X_test[index].str.lower()
for index in test.columns:
test[index] = test[index].str.lower()X_train['ABSTRACT'] = X_train['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')
X_test['ABSTRACT'] = X_test['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')
test['ABSTRACT'] = test['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')X_train = X_train.replace('\s+', ' ', regex=True)
X_test = X_test.replace('\s+', ' ', regex=True)
test = test.replace('\s+', ' ', regex=True)stop_words = set(stopwords.words('english'))
len(stop_words)
X_train['ABSTRACT'] = X_train['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
X_test['ABSTRACT'] = X_test['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
test['ABSTRACT'] = test['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))X_train['combined'] = X_train['TITLE']+' '+X_train['ABSTRACT']
X_test['combined'] = X_test['TITLE']+' '+X_test['ABSTRACT']
test['combined'] = test['TITLE']+' '+test['ABSTRACT']X_train = X_train.drop(['TITLE','ABSTRACT'],axis=1)
X_test = X_test.drop(['TITLE','ABSTRACT'],axis=1)
test = test.drop(['TITLE','ABSTRACT'],axis=1)train_lines = []
for row in range(0,X_train.shape[0]):
train_lines.append(' '.join(str(x) for x in X_train.iloc[row,:]))
test_lines = []
for row in range(0,X_test.shape[0]):
test_lines.append(' '.join(str(x) for x in X_test.iloc[row,:]))
predtest_lines = []
for row in range(0,test.shape[0]):
predtest_lines.append(' '.join(str(x) for x in test.iloc[row,:]))X.replace('[^a-zA-Z]',' ', regex=True, inplace=True)
for index in X.columns:
X[index] = X[index].str.lower()
X['ABSTRACT'] = X['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')
X = X.replace('\s+', ' ', regex=True)
X['ABSTRACT'] = X['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
X['combined'] = X['TITLE']+' '+X['ABSTRACT']
Removing title and abstract from the dataset
X = X.drop(['TITLE','ABSTRACT'],axis=1)X_lines = []
for row in range(0,X.shape[0]):
X_lines.append(' '.join(str(x) for x in X.iloc[row,:]))
Converting to Vectors and then Transforming
countvector = CountVectorizer(ngram_range=(4,8),analyzer='char',lowercase=False,strip_accents='unicode')
X_train_cv = countvector.fit_transform(train_lines)
X_test_cv = countvector.transform(test_lines)
test_cv = countvector.transform(predtest_lines)from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
tfidfvector = TfidfTransformer(sublinear_tf=True,use_idf=True)
X_train_tf = tfidfvector.fit_transform(X_train_cv)
X_test_tf = tfidfvector.fit_transform(X_test_cv)
test_tf = tfidfvector.fit_transform(test_cv)X_cv = countvector.transform(X_lines)
X_tf = tfidfvector.fit_transform(X_cv)
Multi-Output Classifier with Linear SVC
from sklearn.svm import LinearSVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier
model = LinearSVC(class_weight='balanced',loss="hinge",fit_intercept=False)
models = MultiOutputClassifier(model)
models.fit(X_tf, y)
preds = models.predict(X_test_tf)print(classification_report(y2,preds))
print(accuracy_score(y2,preds))
predssv = models.predict(test_tf)test1 = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv')
submit = pd.DataFrame({'ID': test1.ID, 'Computer Science': predssv[:,0],'Physics':predssv[:,1],'Mathematics':predssv[:,2],'Statistics':predssv[:,3],'Quantitative Biology':predssv[:,4],'Quantitative Finance':predssv[:,5]})submit.to_csv('submission.csv', index=False)
Results
With Multi-Output Classifier with Linear SVC we are getting an accuracy of 90% with all the classes having good f1-scores. We can further tune this model by hyperparameter tuning it to improve the scores.
Hope you liked the analysis!
Dataset Link
https://www.kaggle.com/vin1234/janatahack-independence-day-2020-ml-hackathon
Link of the project
https://colab.research.google.com/drive/174jR-jarRwSw2hnu1OyQ8pvkKT_RPM7e#scrollTo=5gkTGnXT3N2W