Rate this post

The task of time series forecasting using machine learning will be covered in this post. Through a concrete example, I will demonstrate how one could seemingly have a good model and decide to put it into production, whereas in reality, the model might have no predictive power at all. If they are applied without caution, they can be very misleading.

Time series forecasting is an important area of machine learning. There are so many prediction problems that involve a time component. The time component makes time series problems more difficult to handle compared to other prediction tasks.

Time series forecasting can be done with machine learning models.

There are different types of models that can be used. In this example, I used a Long short-term memory network, which is a special kind of neural network that make predictions according to the data of previous times. It is popular for a lot of things. In my experience, simpler types of models provide just as accurate predictions. There are models such as e. Temporal information can be included through a set of delays that are added to the input, so that the data is represented at different points in time. Due to their sequential nature, feed forward neural networks are implemented instead of recurrent neural networks.

How to use open source software libraries to implement models.

I usually define my neural network type of models using a high-level neural networks API, written in Python and capable of running on top of TensorFlow or Theano. I usually use a free software machine learning library calledScikit-Learn, which features various classification, regression and clustering, and is designed for other types of models.

The main topic of this article is how to evaluate the model predictions, not how to implement a time series forecasting model. I wo n’t go into the details of model building. There are many other posts and articles on those subjects.

Predicting time series data is an example case.

The figure below shows the example data used in this case. I will get back to the data later, but for now, let ‘s assume it ‘s e The yearly evolution of a stock index. The first 250 days of the data are used as training data for the model and we try to predict the stock index during the last part of the data.

As I do n’t focus on model implementation in this article, let ‘s move on to the process of evaluating the model accuracy. The model predictions seem to follow the real index closely, which indicates a good accuracy. To be a bit more precise, we can plot the real vs model accuracy. Predicted values in a scatter plot can be used to calculate the common error metric R2 score.

The model predictions gave us an R2 score of zero. It was a good match between the real and predicted values. This metric and model evaluation can be very misleading.

This is incorrect.

The model ishis is not the case, and is an example of how choosing the wrong accuracy metric can be very misleading when evaluating model performance. For the sake of illustration, the data was explicitly chosen to represent data that ca n’t be predicted. The data I called the stock index was modeled using a random walk process. A random walk is a completely random process. It is not possible to use historical data as a training set in order to learn the behavior and predict future outcomes. The model is seemingly giving us accurate predictions, but how could that be ?

Predicting time delayed and autocorrelations.

Time series data is different from other types of data because of the temporal aspect. This gives us additional information that can be used when building our machine learning model, that not only the input features contain useful information, but also the changes in input/output over time. The time component makes time series problems more difficult to handle compared to other prediction tasks.

In this example, I used a network to make predictions based on previous data. We can see what the model is doing when we zoom in on the predictions.

Time series data tend to be correlated in time. That means that the index at time t is close to the index at time t. When predicting the value at time t, the model simply uses the value at time t as its prediction ( often referred to as the persistence model ), as shown in the above figure to the right. The model simply uses the previous value as the prediction for the future, as shown by the peak at a time lag of 1 day.

When used wrong, accuracy metrics can be very misleading.

When evaluating the model in terms of its ability of predicting the value directly, the mean percentage error and R2 score both indicate a high prediction accuracy. The model ca n’t possibly predict future outcomes as the example data is generated through a random walk process. One can easily be fooled into being overly confident in the model accuracy if they simply evaluate the models through directly calculating common error metrics.

Time series data has stationarity and differencing.

A stationary time series has statistical properties such as mean, variance, and autocorrelation. Over time, they are all constant. The assumption is that the time series can be rendered stationary. Through the use of mathematical transformations. To time-difference the data is a basic transformation.

Rather than considering the index directly, we are calculating the difference between consecutive time steps.

Defining the model to predict the difference in values between time steps is a stronger test of the models predictive powers. It ca n’t simply use that the data has a strong autocorrelation, and use the value at time “ t ” as the prediction for “ t+1 ” Due to this, it provides a better test of the model and if it has learned anything useful from the training phase, and whether analyzing historical data can actually help the model predict future changes.

There is a prediction model for time-differenced data.

Being able to predict the time-differenced data, rather than the data directly, is a much stronger indication of the predictive power of the model. The figure below shows a scatter-plot of the results of the test. Predicted values.

This figure shows that the model is not able to predict future changes based on historical events since the data is generated using a completely random walk process. Being able to predict future outcomes of a process is not possible, and if someone claims to do this, one should be a bit skeptical.

Is your time series random ?

It is possible that your time series is a random walk.

  • The time series has a strong temporal dependence that decays linearly or in a similar pattern.
  • The time series is not stationary and there is no structure in the data.
  • The best source of reliable predictions can be found in the persistence model, which uses the observation at the previous time step as what will happen in the next time step.

Time series forecasting depends on this last point. The persistence model shows whether you can do better. You are probably dealing with a random walk if you ca n’t. We must be careful that we are not fooling ourselves and wasting time by developing elaborate models for random walk processes because the human mind is hardwired to look for patterns everywhere.

Summary.

The main point I would like to emphasize is to be very careful when evaluating your model performance in terms of prediction accuracy Predicting future outcomes is impossible for a completely random process, and one can easily be fooled. By simply defining a model, making some predictions and calculating accuracy metrics, one could potentially have a good model and decide to put it into production. The model might not have any predictive power at all.

I would urge you to consider yourself a Data Scientist if you are working with time series forecasting. Do n’t draw any conclusions from the data, ask critical questions and always be skeptical. The scientific method should be used in data science.

Source: https://nhadep247.net
Category: Machine