Ensemble Stock Market Prediction using SVM, LSTM, and Linear Regression

—Stock forecasting is challenging because of stock volatility and dependability on external factors, such as economic, social, and political factors. This motivates investors to seek tools to identify stock trends to reap proﬁts. In this research, we compared several heterogeneous ensembles for ﬁnancial forecasting, including averaging, weighted, stacking, and blending ensembles. In addition, we used a random forest regressor as the baseline. Regression was used to predict the next day’s closing stock price. We used classiﬁcation to label closing stock value as HIGH or LOW by comparing with the opening stock value of a particular company. We used Long Short Term Memory (LSTM) models, Linear Regression, and Support Vector Machines (SVM) as individual models. Further, we analyzed 10 years of historical data of the most active 20 companies of the NASDAQ stock exchange for implementing ensemble models. In conclusion, experimental results depict blending ensembles perform the best out of compared ensembles in ﬁnancial forecasting. Further, they reveal SVM is under-performing, LSTM outputs are satisfactory, while linear regression produced promising results.


I. INTRODUCTION
T HE stock market is a collection of markets and exchanges. Regular activities of buying, selling, and issuing of shares of public companies take place at them. Modern stock exchanges use electronic communication to make trades, which is helpful in speed and to reduce the costs. The typical stock values of an exchange are highly volatile. Factors such as supply and demand, interest rates, dividends, management of a company, the economy of the country/region, and political climate affect the stock values [1].
Investors and researchers use various methods to predict stock prices. In the modern age, machine learning based technologies have emerged because of their performance and accuracy in predictions [2]. A model of machine learning is an entity that can identify patterns existing in data through learning. In machine learning techniques, we use different models for the process of prediction. An ensemble approach combines these individual machine learning models together. Thus, they yield more accurate results in most of the cases [3]. We can implement these ensemble prediction methods as a product. Hence, such a product allows a user to get a reasonably accurate prediction. This helps to profit for a user when trading through the stock market. Therefore, this will be useful for investors, regardless of their experience on the subject.
The primary aim of the research was to investigate about heterogeneous ensemble methods for improving the accuracy of the prediction of stock values.

II. RELATED WORK
Most researchers have proven that the ensemble methods outperform the individual component models in both regression and classification [4]. Lawrence, R [5] have concluded that Neural Networks provide more accurate predictions than statistical and regression techniques. Support Vector Machines (SVMs) are widely being used for stock prediction. Hence, SVMs are being researched throughout the literature related to financial forecasting. Many researchers employed SVMs for financial forecasting in most research papers we referred [1], [3], [6], [7]. For an example, a set of N Support Vector Machines (SVM) comprises the proposed ensemble model in [6]. Further, SVMs have proven to surpass traditional Back Propagation Neural Networks (BPNs) and Case-Based Reasoning models (CBR) [8], [9].
Asad [3] and Weng et al. [7] has used random forest classifiers in ensembles for predicting the stock. Narayanan and Govindarajan [1] has used Naïve Bayes' algorithm and SVMs to build ensembles for stock prediction. Patel et al. [10] have conducted a comparison between the SVMs, Naive Bayes, Artificial Neural Networks (ANNs), and Random Forests. They have stated that random forest outperforms other techniques. Araújo and Ferreira [11] had proposed an Evolutionary Morphological Linear Forecasting (EMRLF) for financial forecasting. They have compared Multi-Layer Perceptron (MLP) networks and Time-delay Added Evolutionary Forecasting (TAEF) with their proposed method, and proved that EMRLF outperforms both MLP and TAEF.
Deep Learning is attracting much attention currently in many research areas. Therefore, many researchers have explored the usage of deep learning for stock market prediction [12], [13].
Researchers commonly use econometric models, such as Generalized Auto Regressive Conditional Heteroscedasticity (GARCH) for stock forecasting. Hajizadeh et al. [14] have proposed such hybrid models comprising GARCH and ANN for stock prediction. Kim and Won [15] have explored about the combination of econometric models with Deep Neural Networks. They integrated GARCH models with Long-Short Term Memory (LSTM) models in their research. Further, they have shown that usage of several GARCH models with LSTM models lead to higher performance in stock prediction.
We focus our research on investigating heterogeneous ensembles. It differs from the ensemble survey of homogeneous ensembles such as AdaBoost and Bagging outlined in [4]. Gonzalez [20] which is used for binary classification. SVMs with hard margins [20] fails for problems with noisy data and outliers. Hence, Vapnik et al. introduced SVMs with soft margins [18], [19] to resolve such problems. SVMs with soft margins allow the classifier to misclassify some data points. In contrast to classification, SVMs have applications for regression as well. With Support Vector Regression (SVR) [21], instead of finding an optimal hyperplane that separates data points, SVR finds a function that best approximates target values at a specified level of error tolerance.

B. LSTM
Long Short Term Memory (LSTM) networks fall into the recurrent neural network category. Some designs in RNN were used to predict stock market in similar cases, like [22], [23]. Recurrent networks differ from traditional feed-forward networks providing connections bi-directionally while LSTMs differ from the traditional neural networks with the persistency.
Schmidhuber and Hochreiter [24] introduced LSTM to tackle the vanishing gradient issue that recurrent networks have while dealing with long data sequences. Further, they proved LSTM is one of the most successful RNN architectures in [24].

C. Ensemble
An ensemble model is a collective outcome of individual models, which are combined in such a way that the model outperforms each individual model in most cases.
There are a variety of ensemble techniques. Ensembles fall into two categories as homogeneous ensembles and heterogeneous ensembles based on the learning algorithms used. Homogeneous ensembles use single base learning algorithm, and popular techniques are bagging, boosting, etc. In contrast, heterogeneous ensemble techniques use different base learning algorithms. For example, popular heterogeneous ensembles include methods such as averaging ensembles, stacking ensembles, weighted ensembles, and blending ensembles.
The averaging ensemble takes the mean value of the prediction values of its individual models. The weighted ensemble linearly combines regression outputs of different models, where each model gets a weight based on their performances. Here, the best performing individual model gets the most weight, and we gave a lower weight to other two individual models. Still, these two models can overcome the prediction by the best model by combining. For example, assume we assign 3/7 weight for the best model and 2/7 weight for other two models. Then the weak individual models with less weight can collectively overrule the best model.
A stacking ensemble trains several individual models in parallel and combines them. There are two stages in stacking. In the first stage, we train a set of individual models on the raw data and make predictions on that raw data to generate more features. Then, another model at the second stage predicts the final test data by utilizing the generated features. Base models or level-0 learning models are the terms used to refer models trained at the first stage of the stacking. The terms meta-model or a level-1 learning model refer to models that train at the second stage of the stacking process. However, in practice there can be multiple layers of stacking. The blending ensembles are like the stacking ensembles, but blending ensembles use a hold-out data set to train the meta-model. Further, in blending this meta-model is a linear model. With regression, a meta-model can be a linear model, such as linear regression or logistic regression. Figure 1 shows the training process of a stacking ensemble model. We show only 3 base learners for the simplicity. Base learners train on the raw training data, as shown in Figure 1. We then feed the predictions from those base learners to the meta-learner as training data. Once we train the entire stacking ensemble model, we can make predictions on testing data. We carried out stock prediction as a classification and a regression separately. In classification, we classified target classes as 'HIGH' or 'LOW'. We derived these 'HIGH' and 'LOW' labels by comparing the closing stock price for a particular company with the opening stock price for a particular day. For example, if the closing stock price is higher than the opening stock price, we labeled it as 'HIGH' and as 'LOW' if it is not. Further, in classification, we predicted these class labels for the next day. In regression, we predicted the actual closing stock price of the next day.
Poon and Granger [29] extensively reviewed volatility forecasting in financial markets. In conclusion, they reveal that volatility measured over many past years gives correct forecasts. Further, John Hanke and Dean Wichern [30] show there should be at least 2T to 6T amount of data where T refers to the length of the seasonality. In addition, the number of data points to be used depends on the variability of the data and the model being built [31].
We used train test split as the data splitting technique to preserve temporal data patterns. Cerqueira et al. [32] reveal that out-of-sample methods that preserve the temporal order of observations during data splits give better results.
Shynkevich et al. [33] illustrate that forecasting step affects the prediction accuracy of SVMs. They assert SVMs gain the highest prediction accuracy when the forecasting step is equal to one step. Further, they [34] suggest that SVMs give better performance when the forecasting horizon approximate to the input window length. In addition, Chan et al. [35] reveal that shorter forecast horizons are easier to predict and give better results. Therefore, considering these facts, we are forecasting only one day forward from the current date. Hence, this research focuses on short-term stock prediction.
We considered several heterogeneous ensemble techniques. Some simple ensembles are averaging ensemble and the weighted ensemble. In addition, we implemented more advanced ensembles such as stacking ensembles and blending ensembles. For the stacking ensemble, we combined only linear regression and SVR models. However, all other ensembles used all three models namely, the linear regression model, SVR model, and the LSTM model. We used a random forest regressor ensemble as the baseline for regression ensembles.
In terms of classification, we considered accuracy, precision, recall and F1 score as the performance criteria. Further, we used a random forest classifier ensemble as the baseline for the classification. In addition, for regression we used Root Mean Squared Error (RMSE) and R2 Score performance metrics.

Residual sum of squares T otal sum of squares
Similar to classification, we have used a Random Forest Regressor as the baseline comparator for regression models.

A. Raw data set
We collected raw data from the NASDAQ, which is an American stock exchange [36]. We selected only the most active 20 companies because of computational complexity. Based on previous work, we took 10 years of historical data of these most active 20 companies into consideration. They are 10 years backwards from 21 st September 2019, depending on the availability. Raw data comprises open, close, high, and low attributes for stock price for a particular day in a specific company. The volume column shows the total number of transactions per day. There are 40044 data points in total. A summary of data is given in TABLE I.

B. Data Pre-processing
Data pre-processing transforms raw data to be compatible with machine learning models. Hence, in this research we used the following data pre-processing techniques on raw data.  [37], [38].

C. Implementation
We used Python language for the implementation. In addition, we have used python libraries such as Keras framework with TensorFlow as the backend, Sci-Kit Learn for machine learning. Further, we used Numpy and Pandas packages to manage data. We built the neural network for stock prediction based on earlier research and trial and error. We selected the window size of the LSTM network too based on previous research. However, we used Bayesian optimization for selection of the number of neurons. We chose Adam (Adaptive Moment Estimation) as the optimization method because of its efficient nature, both memory-wise and computationally. The major parameters that need to be used for SVM in classification are the kernel, C, and gamma. Alexander Smola and Bernhard Schölkopf [39] recommend using Gaussian kernels when a few extra information is available. Tay and Cao [9] depict smaller C values in SVMs lead to under-fitting and larger C values in SVMs lead to over-fitting. They suggest an optimal range to test for C values is from 10 to 100 and the polynomial kernels give inferior results & take a long time to train. Hence, considering these facts, in this research, we used Radial Basis Function (RBF) as the kernel function and we tested different C values. We tuned the parameters using the Grid-Search-CV to find optimal parameters. In Linear Regression, we used the default values provided by Sci-Kit-library. Important attributes are fit intercept = True and normalize = False, which means to calculate an intercept and to ignore scaling, respectively. We ignored scaling because we scaled data first and then fed to the Linear Regression model.

B. Regression
We used neural network, linear regression, and Support Vector Regression (SVR) approaches as individual approaches.   Tay and Cao [9] suggest that improper kernel selection in SVMs can cause longer training times and lead to poor performance. In fact, through our experimental setup, we got similar results validating the findings of [9] in terms of training times. Further, we observed using certain parameter combinations, lead to longer training times.
We found the following best hyper-parameters for SVM using RBF kernel, C = 100, gamma = 0.01 in this experimental setup. However, even using those parameters, the performance was low for all the companies. In contrast, linear regression has produced promising results. Further, we can conclude that a higher positive correlation exists between the target values and the training features, which has resulted in better performance of the linear regression model. The opening stock prices and closing stock prices have a relationship that is almost linear. Still, linear regression fails over several companies as illustrated in experimental results (See TVIX). The probable reasons behind the failure of linear regression for TVIX is that they have non-linear trends or because of outliers. As a result, the linear model cannot fit to those companies properly. Based on our experimental setup, we establish that scaling of training features and target values in terms of regression is essential for training of neural networks.

VII. CONCLUSION
In this research, we compared several heterogeneous ensembles by combining results from an ANN, an SVM, and a Linear Regression for stock prediction. We approached the stock prediction problem as a classification and as a regression. We observed SVMs could not capture the underlying patterns in the time series data in the classification. Therefore, this experimental setting was not suitable for time-series classification. A reason for such failure could be the time-axis distortion problem when using classical kernels, like the Gaussian Radial Basis Function (GRBF). In particular, GRBF kernels cannot identify patterns in time series data that contain patterns shifted in time, distorted & scaled. However, Gudmundsson et al. [41] suggests that using kernels with Dynamic Time Warping (DTW) provides better performance for such time-series data.
We identified common pitfalls and mistakes to be avoided during financial forecasting via this research. For example, there could be mistakes when scaling the target values of financial time-series data. Scaling/standardization of target values may depend on the model being chosen. It is better to scale target values for neural networks. The reason for such suggestion is that neural networks may become unstable if large weight variations occur while training the neural networks. Even the literature related to neural network suggests doing so [37], [38]. In terms of classification of time-series problems, when using SVMs, it might be tricky to choose proper hyper-parameters for SVMs. Even the most commonly used kernels for SVMs, such as the Gaussian Kernels, might not capture the time series data patterns when we use SVMs for time-series classification. In fact, we found that using the Gaussian Radial Basis Function (GRBF) kernel, we can not capture the time-series pattern in our experimental setup. Sometimes, improper data splitting of time series data impacts the prediction performance. Since the time series data has temporal patterns and seasonalities, incorrect data splits may break out those hidden patterns. Hence, usual techniques used for data splitting, such as cross-validation, may not be suitable for time series data.

VIII. FUTURE WORK
We investigated only 20 companies and 10 years of data points due to scope and time limitations. In addition, future research could explore the validity of the results with more data from different stock markets and time ranges. We have focused on using only SVM for the classification approach. As subsequent improvements, we propose looking further into other classification techniques. Further, researchers can look into non-linear regression models because we only researched simple linear regression in this study. In terms of neural networks, we have considered only Long Short Term Memory (LSTM) neural networks with a few configurations. As future work, one could start with more hidden layers, more neurons, various optimizers, activation functions, and see whether results change. Interested researchers can even try to apply different neural networks for financial forecasting. In terms of the ensembles, we examined only a few heterogeneous ensembles. However, we have mostly used only the default parameters provided by the libraries that we used, in those ensemble models. Further research can extend these to the hyper-parameter tuning of individual models and building better ensembles.