Wind Power Forecasting
To forecast wind power that can be generated for the next 15 days
In this article we’re going to learn how to predict the wind power that could be generated from the windmill for the next 15 days. We’ll use ARIMA model for forecasting.
Context
Here’s data of a certain windmill. The aim was to predict the wind power that could be generated from the windmill for the next 15 days. A long term wind forecasting technique is thus required.
Content
It contains various weather, turbine and rotor features. Data has been recorded from January 2018 till March 2020. Readings have been recorded at a 10-minute interval.
References
https://www.hindawi.com/journals/mpe/2010/684742/
https://www.sciencedirect.com/science/article/abs/pii/S0925231201007020
Inspiration
Renewable energy remains one of the most important topics for a sustainable future. Wind, being a perennial source of power, could be utilized to satisfy our power requirements. With the rise of wind farms, wind power forecasting would prove to be quite useful.
Importing the Libraries
import pandas as pd
from pandas import Series
import numpy as np
import datetime
from pandas_datareader import data as pdr
import matplotlib.pyplot as plt
import seaborn
from fbprophet import Prophet
import statsmodels.api as sm
import statsmodels.tsa as ts
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from matplotlib import pyplot
import itertools as it
from matplotlib import mlab
Reading the dataset
df = pd.read_csv('../input/wind-power-forecasting/Turbine_Data.csv')
df.tail()
Information about the dataset
df.info()
Plot between Timestamp and Active Power
ig, ax = plt.subplots()
ax.scatter(df["Unnamed: 0"][:1000], df["ActivePower"][:1000])
Histogram for Active Power
df.ActivePower.plot.hist()
Parsing dates
When working with time series data, it’s a good idea to make sure any date data is the format of a datetime object (a Python data type which encodes specific information about dates).
df = pd.read_csv("../input/wind-power-forecasting/Turbine_Data.csv",
low_memory=False, parse_dates=["Unnamed: 0"])# Changing the column namedf['DateTime'] = df['Unnamed: 0']
df.drop('Unnamed: 0', axis=1, inplace=True)# Add datetime parameters df['DateTime'] = pd.to_datetime(df['DateTime'],
format = '%Y-%m-%dT%H:%M:%SZ', errors = 'coerce')
df['year'] = df['DateTime'].dt.year
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['hour'] = df['DateTime'].dt.hour
df['minute'] = df['DateTime'].dt.minute# Dropping original DateTime columndf.drop('DateTime', axis=1, inplace= True)
Filling numeric rows with median
for label, content in df.items():
if pd.api.types.is_numeric_dtype(content):
if pd.isnull(content).sum():# Adding a binary column which tells if the data was missing our not df[label+"_is_missing"] = pd.isnull(content)
df[label] = content.fillna(content.median())
Turning categorical variables into numbers
for label, content in df.items():
# Checking columns which aren't numeric
if not pd.api.types.is_numeric_dtype(content):# Add binary column to indicate whether sample had missing value
df[label+"_is_missing"] = pd.isnull(content)# We add the +1 because pandas encodes missing categories as -1
df[label] = pd.Categorical(content).codes+1
Testing For Stationarity
from statsmodels.tsa.stattools import adfuller
test_result=adfuller(df['ActivePower'])
#Ho: It is non stationary
#H1: It is stationary
def adfuller_test(Power):
result=adfuller(Power)
labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data has no unit root and is stationary")
else:
print("weak evidence against null hypothesis, time series has a unit root, indicating it is non-stationary ")
adfuller_test(df['ActivePower'])
Analysis of ACF and PACF on Close Price
fig = plt.figure(figsize=(10,6))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['ActivePower'], lags=30, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['ActivePower'], lags=30, ax=ax2)
plt.xlabel('Time lag')
plt.show()
Analysis of ACF and PACF on difference Close Price
fig = plt.figure(figsize=(10,6))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['ActivePower'].diff().dropna(), lags=30, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['ActivePower'].diff().dropna(), lags=30, ax=ax2)
plt.xlabel('Time lag')
plt.show()
Modelling
from statsmodels.tsa.arima_model import ARIMA
model=ARIMA(df['ActivePower'][:5000],order=(2,0,3))
model_fit=model.fit()# Plot residual errorsresiduals = pd.DataFrame(model_fit.resid)
fig, ax = plt.subplots(1,2)
residuals.plot(title="Residuals", ax=ax[0])
residuals.plot(kind='kde', title='Density', ax=ax[1])
plt.show()
# Actual vs Fitted
model_fit.plot_predict(dynamic=False )
plt.show()
Splitting data into test and train datasets
from sklearn.model_selection import train_test_split
train = df['ActivePower'][:1000]
test = df['ActivePower'][1000:1015]# Forecastfc, se, conf = model_fit.forecast(15, alpha=0.05) # 95% confidence interval
# Make as pandas seriesfc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)
Metrics to Check the Forecasted Model
def forecast_accuracy(forecast, actual):
mape = np.mean(np.abs(forecast - actual)/np.abs(actual)) # MAPE
me = np.mean(forecast - actual) # Mean
mae = np.mean(np.abs(forecast - actual)) # Mean Absolute Error
mpe = np.mean((forecast - actual)/actual)# Mean Percentage Error
rmse = np.mean((forecast - actual)**2)**.5 # RMSE
corr = np.corrcoef(forecast, actual)[0,1] # corr
mins = np.amin(np.hstack([forecast[:,None],
actual[:,None]]), axis=1)
maxs = np.amax(np.hstack([forecast[:,None],
actual[:,None]]), axis=1)
minmax = 1 - np.mean(mins/maxs) # Min-Max
return({'mape':mape, 'me':me, 'mae': mae,
'mpe': mpe, 'rmse':rmse,
'corr':corr, 'minmax':minmax})
forecast_accuracy(fc, test.values)
Result
Around 2.6% MAPE implies the model is about 97.4% accurate in predicting the next 15 observations.