Demand Forecasting of Global Superstore using Sarimax
Context
Retail dataset of a global superstore for 4 years.
Perform EDA and Predict the sales of the next 6 days from the last date of the Training dataset!
Content
Time series analysis deals with time series based data to extract patterns for predictions and other characteristics of the data. It uses a model for forecasting future values in a small time frame based on previous observations. It is widely used for non-stationary data, such as economic data, weather data, stock prices, and retail sales forecasting.
About SARIMAX
Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an extension of ARIMA that explicitly supports univariate time series data with a seasonal component.
It adds three new hyperparameters to specify the autoregression (AR), differencing (I) and moving average (MA) for the seasonal component of the series, as well as an additional parameter for the period of the seasonality.
How to Configure SARIMA
Configuring a SARIMA requires selecting hyperparameters for both the trend and seasonal elements of the series.
Trend Elements
There are three trend elements that require configuration.
They are the same as the ARIMA model; specifically:
- p: Trend autoregression order.
- d: Trend difference order.
- q: Trend moving average order.
Seasonal Elements
There are four seasonal elements that are not part of ARIMA that must be configured; they are:
- P: Seasonal autoregressive order.
- D: Seasonal difference order.
- Q: Seasonal moving average order.
- m: The number of time steps for a single seasonal period.
Together, the notation for an SARIMA model is specified as:
SARIMA(p,d,q)(P,D,Q)m
Dataset
The dataset is easy to understand and is self-explanatory.
Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
warnings.filterwarnings('ignore')
Reading the Dataset
data = pd.read_csv("../input/superstore-data-demand-forecasting/superstore.csv")data = data.drop_duplicates()
data.shape(9800, 22)data.columnsIndex(['Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code',
'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name',
'Sales', 'Quantity', 'Discount', 'Profit', 'Price'],
dtype='object')
Exploratory Data Analysis
def plotbarcharts(dataset,columns):
%matplotlib inline
fig,subplot = plt.subplots(nrows=1,ncols=len(columns),figsize=(18,5))
fig.suptitle('Bar Chart for' + str(columns))
for columnname,plotnumber in zip(columns,range(len(columns))):
dataset.groupby(columnname).size().plot(kind='bar',ax=subplot[plotnumber])columnsList1 = ['Ship Mode','Region']
columnsList2 = ['Region','Category','Sub-Category']plotbarcharts(data,columnsList1)
- Most of the orders are from Standard Class of Ship Mode.
- Most of the orders are coming from west region followed by east region.
plotbarcharts(data,columnsList2)
- Office Supplies has the highest count among categories.
- Binders have the highest count followed by Papers and Furnishing in sub-category.
data.groupby(['State']).size().plot(kind='bar',figsize=(18,8))
- Most of the orders are coming from California followed by New York and Texas.
data.groupby(['Month']).size().plot(kind='bar')
- Most of the orders are from November month.
data.groupby(['Year']).size().plot(kind='bar')
- Year 2018 received the highest number of orders.
data.set_index("Order Date", inplace = True)
data['Sales'].plot()
Aggregating the sales quantity for each month for all categories
pd.crosstab(columns=data['Month'],
index=data['Year'],
values=data['Sales'],
aggfunc='sum')import matplotlib.pyplot as plt
SalesQuantitiy=pd.crosstab(columns=data['Year'],
index=data['Month'],
values=data['Sales'],
aggfunc='sum').melt()['value']
MonthNames=['Jan','Feb','Mar','Apr','May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']*4
# Plotting the sales
%matplotlib inline
SalesQuantitiy.plot(kind='line', figsize=(16,5), title='Total Sales Quantity per month')
# Setting the x-axis labels
plotLabels=plt.xticks(np.arange(0,48,1),MonthNames, rotation=30)
- There is a clear seasonal pattern in the dataset.
Seasonal Decompose
from statsmodels.tsa.seasonal import seasonal_decompose
series = SalesQuantitiy.values
result = seasonal_decompose(series, model='additive', freq=12)
result.plot()
CurrentFig=plt.gcf()
CurrentFig.set_size_inches(11,8)
plt.show()
Applying SARIMAX
Training the model on full dataset
SarimaxModel = model = SARIMAX(SalesQuantitiy,
order = (5, 1, 10),
seasonal_order =(1, 0, 0, 12))
SalesModel = SarimaxModel.fit()
Forecasting for the next 6 months
forecast = SalesModel.predict(start = 0,
end = (len(SalesQuantitiy)) + 6,
typ = 'levels').rename('Forecast')
print("Next Six Month Forecast:",forecast[-6:])
Plotting the forecasted values
SalesQuantitiy.plot(figsize = (18, 5), legend = True, title='Time Series Sales Forecasts')
forecast.plot(legend = True, figsize=(18,5))
Measuring the Accuracy of the model
MAPE=np.mean(abs(SalesQuantitiy-forecast)/SalesQuantitiy)*100
print('#### Accuracy of model:', round(100-MAPE,2), '####')
Printing month names in X-AxisMonthNames=MonthNames+MonthNames[0:6]
plotLabels=plt.xticks(np.arange(0,54,1),MonthNames, rotation=30)
Results
With SARIMAX we are getting an accuracy of 76%, by further applying hyperparameter tuning we can improve the accuracy and also try different time series techniques like Facebook Prophet, Autoregressive Moving Average (ARMA), Autoregressive Integrated Moving Average (ARIMA) etc.
Hope you liked the analysis!
You can follow me on Linkedin , Github and Kaggle.
Github Link
Dataset Link
https://www.kaggle.com/bravehart101/sample-supermarket-dataset/code
Link of this project
https://colab.research.google.com/drive/1r55Hty6sb5gXDSPVLaMxfU0hV3pXrCnj#scrollTo=technical-spell