DeepLSF: Fusing Knowledge and Data for Time Series Forecasting

—Deep Neural Networks (DNNs) rise to fame has been meteoric. However, there has been an increased realization that relying purely on data to ﬁnd solutions can be sub-optimal because, in doing so, we ignore one very important modality i.e. knowledge. As a result, techniques based on fusing knowledge with DNNs are becoming common. However, most of the current fusion schemes revolve around knowledge distillation, where student network mimics predictions of a more complex teacher network. We argue that each modality, be it data-driven or knowledge-driven, has its own fortes and a fusion technique should combine individual strengths of each of the modalities rather than trying to copy. Motivated by this, we present Deep Latent Space Fusion (DeepLSF) for time series forecasting that uses features from both knowledge and data domains in a complementary manner. We test DeepLSF on three real-world time series datasets belonging to different domains. We show that predictions produced by DeepLSF are not only better than the predictions made by the constituent modalities but are also better than the state-of-the-art for each of the datasets by a signiﬁcant margin. Leveraging the knowledge domain to complement the performance of neural networks will prove useful in many real-world application scenarios.


INTRODUCTION
T Ime series forecasting has always remained a vital problem due to its direct impact on many real-world applications. Be it within the ambit of supply chain management, resource planning or service and maintenance, etc., having an accurate estimate of prospects can yield higher profits, minimize losses, and can enable optimal resource utilization. As a result, any improvement in performance in forecasting methods is highly valuable.
Methods involving time series forecasting can be broadly categorized into two domains: data-driven Deep Neural Networks (DNNs) and knowledge-driven Knowledge Driven Systems (KDS). In a typical DNN setting, the DNN is trained on historical data where it learns to extract useful features and patterns from the data that are effective in reaching the solution. Recent advances in computational hardware and DNN architectures have enabled DNNs to achieve state-of-the-art (SOTA) performance not only in time series forecasting tasks [1] but also in other domains such as image classification [2], natural language processing [3], speech recognition [4], visual question answering (VQA) [5], image captioning [6], etc. The performance of DNNs across different domains speaks volumes about their dexterity in extracting useful features from the data. However, it is a common observation that knowledge about the problem is also important for the solution and although DNNs excel in processing data, they completely ignore domain knowledge which can be equally useful. Domain knowledge encompasses information obtained by experts who have a considerable understanding of the problem through relevant education and experience. As a result, where DNN is only limited to information present in his-of the output. We argue that in an optimal setting one network should complement the other instead of copying. Moreover, in time series forecasting or regression tasks, the outputs are numeric estimates of future values instead of a distribution. As a result, KL divergence based knowledge sharing schemes can not be directly employed as there is no underlying probability distribution. Currently, there exists a huge gap between data-driven and knowledgedriven modalities for time series forecasting tasks, and given its importance, it is imperative that a knowledge sharing framework for forecasting is developed.
On this front, we introduce a novel latent-space fusion technique, Deep Latent Space Fusion (DeepLSF), for time series forecasting that utilizes information contained in both data and knowledge modalities to come up with predictions. DeepLSF combines features drawn from the knowledge stream with those obtained by DNN at the correct level of abstraction, essentially offsetting the information missing from one modality with the one present in the other. We test DeepLSF on three real-world benchmark forecasting datasets belonging to different applications and having different temporal characteristics. DeepLSF not only produces better forecasts when compared to predictions made by either DNN or KDS individually, highlighting the effectiveness of its knowledge sharing capabilities, but also achieves SOTA results on all three datasets, cementing its prediction efficacy and robustness irrespective of the application domain. In particular, the contributions of this paper are: • We introduce Latent-Space fusion network that automatically encodes knowledge into appropriate representation from predictions made by KDS and projects it into the intermediate layer(s) of the DNN by aligning the subspace of knowledge representations with that of intermediate DNN layer(s). • We perform extensive experimentation and ablation studies to highlight the effectiveness of combining information from both knowledge and data modalities in multi horizon forecasting setting by using the proposed fusion network. DeepLSF achieves state-ofthe-art results on every dataset tested. On average, it outperforms SOTA by 21.5% on every evaluation metric used.
The rest of the paper is organized as follows. We start by doing a review of the work that has already been presented in the literature in Section 2. We briefly explain the forecasting problem in Section 3. In section 4, we provide a detailed description of DeepLSF architecture along with a thorough explanation of each of the components involved in it. This includes the KDS, the DNN architecture, and the latent space fusion framework. We then present experimental settings along with a brief description of the dataset and results obtained for each of the datasets in Section 5. This section also includes an extensive ablation study, where we try out different settings of DeepLSF and evaluate their results. Finally, we conclude the paper in Section 6.

RELATED WORK
In this section, we introduce relevant work and background studies pertaining to knowledge sharing and multi-modal learning approaches. Integrating domain knowledge or any sort of extra information to boost DNN's performance has been an active area of research over recent years with the rationale that introducing useful human priors in the learning process can increase the performance of DNNs. Ghazvininejad et al. (2018) [16] presented a knowledgebased conversation model powered by neural networks. In addition to utilizing data containing the conversation history, they also conditioned the output of their model on external non-conversational information, relevant to the current context of the conversation, by using a multi-task learning approach. Their model was able to produce realistic responses that were found to be more informative and appropriate by human evaluation. Similarly, Venugopalan et al. (2016) [17] proposed a neural network based language model that also employed linguistic knowledge mined, from a considerably large text corpus, in the form of semantics to produce video descriptions. They reported significant improvement in both the grammar as well as the overall description quality generated by the model. They experimented with different fusion techniques namely early, late, and deep fusion. The fusion stage was defined as the point at which they concatenated the hidden states from both the video to text network as well as the LSTM language model. Another domain where features from different modalities are used in a multi-modal setting is Visual Question Answering (VQA) systems. Ma et. al.(2018) [18] proposed a VQA system that utilizes neural networks to learn joint embeddings from visual and textual domains representing image and question pairs. In addition to learning joint embeddings, they also employed a co-attention mechanism that gives weights to every feature learned from visual and also textual domains highlighting the relevant features in both domains. They claimed that the framework achieves SOTA results on two benchmarks VQA datasets. All of these systems are highly dependent on expert performance, which in turn is dependent on the quality of additional corpus. Moreover, trying to learn similar embeddings from two different modalities although, is useful in extracting common information, however, we believe that in time series forecasting setting, focusing on features learned separately from each modality is equally important to fill in information missing from one modality. Towell et. al.(1994) [19] proposed the Knowledge-based Artificial Neural Networks (KBANN) architecture where they utilized propositional rules for knowledge representation. These logic rules were structured hierarchically and neural network architecture is constructed in a way where it has a direct correspondence with elements in the ruleset. Each element in the ruleset is represented by a neuron in the neural network architecture and the connections between the neurons are determined by the relations between these elements defined in the ruleset. To ensure that the model has some excess capacity to learn new features, additional neurons were also introduced whose weights were learned during the training phase. A similar approach has been employed in Tran et al.(2018) [20]. Here too, a set of logic rules is employed in conjunction with the neural network. These approaches have the advantage of incorporating the knowledge from the knowledge base directly into the architecture of the neural network. However, this limits the flexibility of the model as the architecture of the neural network is strictly defined by the ruleset. This renders most of the current SOTA DNN models incompatible with the approach and thus, limits the ability to use off-the-shelf architectures for the task at hand. Buda et al. (2018) [21] leveraged the capabilities of statistical models to aid the neural network in producing accurate forecasting results. These forecasting results were ultimately used for anomaly detection. To further boost performance, they used a family of different statistical models whose predictions were combined into a homogeneous setup. The predictions from all the models were compared against the available ground-truth to select the most plausible prediction based on the lowest Root Mean Squared Error (RMSE) which they termed as "single-step merge". They also explored an alternate approach where only a single model was selected for all the predictions based on the lowest RMSE value. Since the model considers all of the predictions in isolation, it is unable to take advantage of both streams simultaneously as opposed to DeepLSF.
The use of statistical methods as a source of knowledge in modeling time series is not uncommon. Chattha et al. (2019a) [22] proposed a residual learning scheme, called Knowledge Integrated Neural Network(KINN), where they incorporated expert knowledge in the form of prediction in the network by adding it to the network's output. Despite its advantages, the approach cannot be directly scaled to multi-step predictions. Moreover, the performance offered by KINN did not scale to different datasets. This was primarily because KINN was unable to model time series that had a strong trend present in the past sequences. They improved upon KINN in DeepEx (2019) [23] where they used a separate predictor to model trends in the data. Similarly,   [24] also utilized the power of statistical models to enhance the performance of their DNN which was ultimately used for the task of anomaly detection. The proposed model offered better performance in anomaly detection tasks. Hu et al. (2016) [25] harvested expert knowledge in the form of first-order logic rules. To transfer the knowledge to the network parameters, they utilized an iterative knowledge distillation technique. They used their expert model as the teacher to train a student network comprised of a DNN based architecture. The DNN attempted to replicate the predictions made by the expert i.e. the teacher network. They updated both the teacher as well as the student at each iteration during the learning process. The main attempt was to find a teacher network that can match the ruleset in terms of predictions while not diverging significantly from the predictions of the student. Additionally, they also minimized the KL-divergence between the distributions of the prediction made by the teacher and the student to make the two distributions similar. With the proposed framework, they were able to achieve SOTA performance in classification tasks on the evaluated datasets. Following the similar steed, Xie et al. (2019) [26] used a noisy student-teacher learning paradigm. They initially train the teacher model on the ImageNet dataset which is an expert in their context. This teacher/expert network is then used to generate predictions on a larger dataset, i.e., JFT-300m which is comprised of 300 million images. These pseudo labeled images along with the original ImageNet images are passed onto the student network for training. This process is repeated where the student is replaced as the teacher and a new more powerful student network is spawned. With this self-training scheme, they improved ImageNet top-1 accuracy by 1%. Since the task of the student network in this framework is to emulate the predictions made by the teacher, this leads to the strength of the student network being ignored by the system. In contrast, DeepLSF does not force the strengths of one modality on the other but rather works in a complementary fashion where individual strengths of each modality are retained.

PROBLEM FORMALIZATION
In this section, we mathematically formalize the forecasting problem and explain the terminologies that will be used throughout this paper. In a typical forecasting setting using DNNs, current value of the time series along with its lagged versions are used to predict values in the future. Formally, this is represented by a list X p containing values x t , x t−1 , ..., x t−p . Here p represents the total number of past value used for prediction and x t represents the value of time series at time t. The aim of time series forecasting network is to learn a parametric model which maps X p values from the past toŶ h . HereŶ h represents a list containing predicted valuesx t+1 ,x t+2 , ...,x t+h and h represents horizon i.e. the number of future values to be predicted. This parametric mapping can be mathematically expressed as: encapsulates the parameters of the network which is comprised of L layers and Φ : R p+1 → R h defines the mapping from the input space to the output space. The optimal parameters of the mapping function W * are computed in a recursive manner using gradient descent. After each iteration, the loss is calculated that estimates how far predictions given by the model are from the ground truth, and the parameters of the model are adjusted to minimize the loss. This process is repeated until the loss plateaus. Typically, Mean Squared Error (MSE) is used as a loss function, and hence, the optimization problem for regression can be mathematically stated as: where Y t+i andŶ t+i denotes the ground truth and the predicted value at time t + i respectively.

DEEPLSF: THE PROPOSED METHOD
In this section, we introduce the proposed DeepLSF network and discuss in detail individual components of the architecture. The overall architecture of the network is shown in Fig. 1. DeepLSF consists of (i) a KDS, (ii) a DNN, and (iii) a latent space fusion network. KDS utilizes the underlying knowledge to come up with predictions. The DNN is a traditional convolutional neural network and the fusion network consists of an Encoder-Decoder based network that encodes predictions made from the KDS into knowledge representations. The fusion network also consists of a projection network that projects knowledge representations extracted using feature extractor network into the intermittent layers of DNN. Since the intermediate layers of DNN after fusion consists of features drawn from both KDS and DNN, the final predictions made by DeepLSF use information from both the knowledge and the data modality to come up with predictions. In the following subsections, we further elaborate on each of the constituent networks.

Knowledge Driven System
The term 'knowledge' is tricky to explain and equally difficult to store and manifest in a computer algorithm. However, knowledge can be considered as a set of rules that are to be followed in order to reach a reasonable solution and in the case of time series forecasting, rules and methods drawn from the statistical theory have shown impressive results [11], [12], [13]. Hence, we also employ concepts used in one of the statistical methods in our KDS. Specifically, we use 4Theta 1 method. 4Theta draws its root from Theta [27] method where time series is decomposed into theta lines. Theta line θ at point t can be obtained by the Eq. 5 where Y is the second difference of the data and b and a are the intercept and slope of the regression at time Y 0 . The parameter θ controls the curvature of theta lines with θ = 0 modeling the long term linear trend in the time series and higher order θ modeling the deviations from the linear trend by magnifying the local curvatures. For forecasting, these lines are extrapolated and combined. For simplicity, we mathematically represent a theta model comprising of only two theta lines as follows where ω 0 and ω θ are the weights of Theta lines modeling the linear and curvature of the original data. The parameter θ can be optimized to attain forecasts with the lowest errors. 4Theta improves upon the classical Theta method by automatically adjusting for linear and non-linear trends by using multiple Theta models and selecting the best one. This offers a more generalized forecasting framework that is capable of handling more complex time series [28]. Although We have used the 4Theta method as our knowledge based model, DeepLSF is agnostic to underlying 1. https://tinyurl.com/vtwfzm7p knowledge used in the KDS. DeepLSF only takes predictions made by the KDS into account regardless of its internal architecture or implementation details. Hence, any other technique or statistical method that is capable of coming up with predictions can be employed in DeepLSF. This will be further explained in subsection 4.3 where we explain the knowledge fusion technique in detail.

Deep Neural Network
We employ Convolutional Neural Networks (CNN) as our data-driven method. Although the aim of this research is to develop a framework that allows for knowledge sharing between two different modalities, we still spent considerable compute effort to find optimal network parameters for DNN through extensive grid-search over reasonable hyperparameter search space. One interesting finding of architecture search was that although Long Short-Term Memory (LSTM) networks [29] are considered the go-to architecture for time series modeling, owing to their significant proficiency in modeling long-term dependencies in the data, their performance, however, was slightly inferior or comparable to the performance given by CNN for the datasets employed in our experiments. These discoveries also align with the findings of [30]. The final architecture for all the datasets consisted of three convolutional layers and a fully connected layer which were then followed by the prediction layer responsible for giving the final predictions. We used Rectified Linear Activation Unit (ReLU) as the activation function for convolutional and fully-connected layers. For the traffic dataset, the architecture has 16, 16, 32, and 64 units while for energy and NASDAQ datasets, the number of units is 4, 32, 64, and 128. Details of datasets will be given in detail in section 5.1.

Latent Space Fusion
When processing information, humans tend to find relatable templates and features that match the information they have already seen or experienced [31]. DNNs, inspired by the human neural system, also attempt to learn general features from the data which can be then further voted to reach a final prediction. However, unlike DNNs whose flexibility is limited by the training set, humans can use information from other domains to aid in their decision making process [32]. Sharing knowledge from different modalities and sources of information is perhaps one of the strongest feats of humans and we aim to enable DNNs to have a similar ability. Motivated by this, we propose a novel fusion mechanism that allows for knowledge sharing across different modalities in DNNs. This is done by utilizing two specialized networks (i) feature extractor and (ii) projection network as shown in Fig. 1. The job of the feature extractor network is to obtain relevant features from the KDS predictions that can be further used to supplement features obtained by DNN. For this, we employ an Encoder-Decoder based approach. In a typical Encoder-Decoder setting, the encoder is trained to learn compressed latent features and the decoder is trained to reconstruct the original input from these compressed features. This way the latent features learned by the encoder are most representative of the input data, where redundant information present in the data is mitigated. We employ the Encoder-Decoder network in a similar fashion, where the predictions obtained from KDS, sec 4.1, are first fed into the Encoder-Decoder network which learns compressed knowledge representations. This training is done in isolation, separate from end-to-end DeepLSF optimization. The representation learned by the encoder consists of salient information contained in the knowledge driven method since the decoder is trained to minimize reconstruction loss. Once the reconstruction loss has plateaued, the decoder is not utilized further and is discarded. The weights of the encoder are frozen and are not further optimized. This can be represented mathematically as: where W Exp E and W Exp D denotes the weights of the encoder and the decoder respectively trained using KDS predictions. Similarly, D and E denotes the activation function of the decoder and encoder. The optimal weights obtained after training the models are denoted by * . λ controls the weight of the regularization term as compared to the reconstruction loss.
After the encoder has been trained, the compressed feature representations learned by it are fed into the projection network which fuses these features with intermediate layers of the DNN. The projection network serves two purposes. First, it projects the compressed feature vectors into equivalent subspace as that of latent features of the intermediate layer of the DNN, where these features from the knowledge base are being added. Second, it matches the dimensionality of the compressed feature vector with that of the hidden layer of the DNN. The projection network is trained in conjunction with the training of DNN in an end-to-end manner. This way the projection network learns the optimal projection that gives the least prediction errors while training. We believe that integrating features from the knowledge domain at the intermediate layers of the network will prove beneficial since it constrains the optimization problem without relying on the end-to-end learning scheme to extract useful features from KDS predictions. This also avoids the trivial case where the network only learns to leverage information from one stream due to poor learning of the latent space mapping. The final output of the prediction network is based on features drawn from the data using DNN and as well as on features drawn from the KDS. Mathematically the prediction given by the prediction network can be represented as: where y is the forecast given by the DeepLSF prediction network. Φ is the final predictor with N layers and k is the layer at which compressed knowledge representations from KDS predictions are embedded. W Exp * E and E denotes the optimal weights and activation function of the encoder already trained using expert predictions in Eq. 7. y exp are the expert predictions and W * P and P denotes final weights and activation of the projection network. W * k+1:N denotes the optimal weights of the DNN. The optimal weights are computed by minimizing the mean squared error between predicted and desired forecasts, which can be mathematically represented by the following equation: where x is the input while y is the desired forecast, Φ is the final predictor with N layers. As can be seen from Eq 8, the final output given by the network makes use of the information contained in knowledge and as well as the data streams by combining features obtained from both of these modalities. The parameter k is a hyperparameter that dictates the layer of DNN at which knowledge representations are incorporated. We also study the impact of using different k in Section 5.6. For the rest of the experiments, we always fuse the features at the first layer, i.e. k = 1 since it gave the best results on datasets evaluated in this paper.

EXPERIMENTS
In this section, we evaluate DeepLSF with extensive experimentation. For evaluation, we have used three widely used real-world forecasting benchmarking datasets. Firstly, we will briefly discuss the characteristics of the datasets which will be followed by training details for each of the datasets. After which we will demonstrate results achieved by DeepLSF in making short and long-term predictions and will compare them with the results of current SOTA and other baseline methods present in the literature. Next, we will validate the effectiveness of the proposed knowledge fusion mechanism by comparing the results of the architecture with and without knowledge fusion. Finally, we will present an ablation study in which we will evaluate and analyze different configurations of DeepLSF and comment about the configuration of DeepLSF proposed in this paper

Datasets and Data Preprocessing
In order to determine the efficacy of DeepLSF, we employ three real-world publicly available datasets for evaluation. Each dataset belongs to a different application domain. The datasets utilized are as follows. In Table 1 we list some statistics of each of the datasets used along with train, test and validation split. Please note that the train, test and validation splits used in this study are the same with other studies that we compare against in section 5.4 in order to keep the comparison fair.

Training Details and Preprocessing
To get training samples from each of the time series, a rolling window approach is used where a window of size inputWindowSize and size HorizonSize is used to get an inputlabel pair from the time series respectively. The next training sample is obtained by shifting both of the windows by an index of 1 and this process is continued until the whole of the time series is covered. As a result, each time series gives (TSLength-inputWindowSize-HorizonSize+1), where TSLength is the length of the time series. Samples for validation and test set are obtained using the split mentioned in Table1. An important factor that directly impacts the performance of DNN is preprocessing. Each input sample given to DNN can have a different scale and distribution which, in turn, may increase the difficulty of modeling the problem for DNNs since, in addition to learning the input to output mapping, the DNNs also have to learn the scale of each of the input. As a result normalization of input variables is essential for DNN based approaches to confine the scale of the input variables within a certain range. In our experiments, we employ Min-Max scaling as our normalization technique. Here each input-label pair given by the rolling window approach is normalized using Min-Max scaling, which scales the range of input values to be between 0 and 1. Normalization is done for all of the datasets except for the traffic dataset, where results without normalization were far better. This can be attributed to the fact that values for this particular dataset were always within a confined range which made it possible for the network to model the time series efficiently.
Once all of the time series in a given dataset are converted into input label pairs and normalized, where applicable, the training samples from each of the time series are concatenated and DeeplSF is trained in a univariate setting, where one DeepLSF model is trained for all of the time series in the dataset. Feature extractor network is also trained in a similar fashion where a single encoder model is trained for all the time series of a particular dataset. Here too, the inputs are normalized using Min-Max scaling. During the training of DeepLSF, the initial learning rate is set to 10 −3 which is decreased by 10 −1 every time the loss plateaus. The set of parameters that give the lowest validation loss are saved and used for the evaluation of the test set.

Evaluation Metrics and Baseline Methods
To evaluate the performance of DeepLSF we employ two widely used evaluation metrics namely, Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAE). RMSE and MAE are calculated for each of the time series in the dataset which is then averaged to form one representative value for the entire dataset. This can be mathematically expressed as where n is the number of time series in the dataset, Y t+i is the ground truth andŶ t+i is the prediction of the model at the future moment t + i. s is the number of samples in the test set. These metrics are chosen primarily because of their widespread usage in related literature, which makes it easier to compare the performance of different techniques. For traffic dataset, we mainly compared with Spatio-Temporal Graph Convolutional Networks (STGCN) [33], Diffusion Convolutional Recurrent Neural Network (DCRNN) [36], Graph-WaveNet [37] and Spatial-Temporal Transformer Networks (STTN) [1]. Almost all of these techniques revolve around fusing graph networks with traditional neural networks. This allows the merging of spatial information from all the time series in the dataset with temporal information obtained from the individual time series being predicted. For datasets relating to vehicular traffic, spatial information proves useful since roads in a particular area are connected and hence information of traffic on the adjacent or connecting road eventually leads to better prediction. Currently, STTN [1] is SOTA for PeMSD7(M) dataset. Along with these recent models, we also compare our results with traditional statistical and DNN methods including Linear Support Vector Regression (LSVR), Auto Regressive Moving Average (ARIMA), Feed-Forward Neural Network (FNN), and Fully-Connected LSTM (FC-LSTM).
For Energy and NASDAQ datasets, we compared with Auto-Encoder Convolutional and Recurrent Neural Network (AECRNN) [38], Long-and Short-term time series network (LSTNet) [39] and Multi-level Construal Neural Network (MLCNN) [40]. MLCNN [40] revolves around extracting multi-level abstract representations of the raw data by using convolutional networks. These multiple representations are extracted for both near and distant future predictions which are then fused by using the Encoder-Decoder approach and scaled by using traditional Autoregression models. Currently, MLCNN [40] is SOTA on both of these datasets. Apart from these methods we also compared with classical LSTM [41] and Multi-Task CNN (MTCNN) [42] models along with Vector AutoRegressive (VAR) method.

Results
To keep the comparison with other techniques impartial, we use the same learning settings as those used in current best performing work in the literature. We use the same train, validation, and test split along with the same input window and horizon size. Table 2 summarizes the results of DeepLSF and other baseline techniques mentioned in section 5.3 for traffic PeMS7(M) dataset. All the models are evaluated on predictions made for 15, 30, and 45 minutes ahead forecasts. As the sampling rate of the data is 5 mins, this translates to the horizon of 3, 6, and 9 respectively i.e. {t+3, t+6, t+9}.
DeepLSF outperforms all of the recent methods by significant margins. In particular, DeepLSF improves over current best performing model STTN [1] by 11.7%, 6.3%, 32.7% on MAE and by 12.6%, 11.4%, 33.6% on RMSE for 15, 30 and 45 min ahead prediction respectively. Although STTN [1] takes advantage of long term temporal information along with spatial dependencies in the data, the results show that utilizing knowledge allows DeepLSF to model time series more effectively. Table 3 shows results on energy dataset. Here, the models are evaluated for predictions made for the horizon of 3, 6, and 12 i.e. {t+3, t+6, t+12} corresponding to values for 30, 60, and 120 minutes in the future. Similar to the results in the traffic dataset, DeepLSF outperforms current state-ofthe-art and other baseline methods in every setting. This demonstrates the effectiveness of DeepLSF in predicting both near and distant future values. Specifically, DeepLSF improves the SOTA by 3.2%, 15.1%, 22.4% on MAE and 5.4%, 5.1%, 6.6% on RMSE. Table 4 shows the results of all of the evaluated models on the NASDAQ dataset. Here, the models are evaluated for predictions made for the horizon of 3, 6, and 12 which correspond to future values for 3, 6, and 12 minutes in this case since the sampling rate of the dataset is 1 minute. Here as well, DeepLSF outperforms every baseline model including the SOTA. DeepLSF outperforms SOTA method MLCNN [40] Fig. 2. This demonstrates the efficacy of DeepLSF in a multi-horizon prediction setting where the network is evaluated for predicting both near and distant values in the future. Moreover, it also highlights the robustness of DeepLSF in modeling different time series belonging to different   Fig. 3 shows the average improvement given by DeepLSF for predictions made for near, medium, and far horizons. Getting more accurate predictions for values that are more distant in the future is one of the significant advantages of DeepLSF which will prove useful in almost every domain. Specifically in domains like stocks and health etc, where more accurate estimates of the distant future can lead to more efficient planning which can potentially yield higher profits and save human lives.

Effectiveness of the Knowledge Fusion Scheme
In this subsection, we evaluate and validate the effectiveness of using the proposed fusion scheme in modeling time series, which in turn contributes towards better forecasting performance.
In order to demonstrate this, we first train the DNN used in DeepLSF without the knowledge fusion scheme, which we refer to as vanilla DNN   DeepLSF achieves lower loss on the training set when compared to the loss achieved by vanilla DNN. This can be seen by the solid blue line in Fig 4. Similarly, DeepLSF also achieves lower validation loss which shows that overall DeepLSF is able to model the time series more effectively as it has a better in-sample fit without overfitting.
However, achieving lower loss in training only paints half of the picture and it is also imperative that the model performs equally well on unseen data in order to make sure that the model learned by the architecture can work in a realworld setting. Moreover, it is also important to demonstrate that the improvement offered by DeepLSF can actually be attributed to knowledge fusion scheme. To show this, we compare predictions given by DeepLSF with those given by the constituent modalities, knowledge-driven and data-driven, when evaluated in isolation without the knowledge fusion scheme. For this, we computed RMSE and MAE on predictions made by the vanilla DNN and, as well as, the KDS on the test set for each of the datasets. The performance metrics of these networks are then compared with the ones given by DeepLSF to validate the effectiveness of the fusion mechanism in improving the prediction efficacy of the model. Table 5 shows the results given by vanilla DNN, KDS, and DeepLSF. An additional column is also added that shows the improvement in performance achieved by DeepLSF in percentage over vanilla DNN and KDS.
DeepLSF consistently gives better forecasting performance on all of the three datasets, when compared to the individual predictions given by simple DNN and KDS. On average DeepLSF performed around 16% better in MAE and 11% better in RMSE. The results validate that the knowledge fusion mechanism employed in DeepLSF is capable of fusing information from both streams in a way where information from one stream offsets the information missing from the other i.e. both modalities are used in a complementary manner rather than one modality trying to copy strengths of the other. This is evident, as it gives better overall forecasting performance when compared to the individual performance given by the constituent networks. Combining the strengths of the data and knowledge-driven modalities was one of the goals that we initially set towards achieving a more natural fusion mechanism.

Ablation Studies
In this section, we evaluate different architectural configurations of DeepLSF in order to support the architecture proposed in DeepLSF.
DeepLSF uses a latent space fusion network to combine information from the knowledge stream at the first layer of the architecture. We believe that fusing features in the initial layers of DNN allows the network to focus on information missing from the expert network while extracting features from the input layer. Moreover, it also gives the model enough capacity to learn and compute more complex nonlinear relationships from features fused from the knowledge stream in later layers of the network. However, since the DNN used in DeepLSF has three convolutional layers, it is also interesting to investigate the impact on forecasting performance if the projection network is used to fuse information to other intermediary layers of the network. This translates to the value of k in Eq 8 to be greater than 1. We analyze the impact of using different intermediary layers for fusion, individually, along with using the fusion network to project knowledge features on all three layers of DNN at once. Table 6 shows results of these experiments evaluated on traffic dataset.
The results verify the arguments we gave above as fusing information at the first layer of the DNN gives the best overall forecasting performance. Fusing at all layers of the DNN did not improve the results, which intuitively makes sense since the network already has the information that is being fused at deeper layers. However, all the results obtained, irrespective of the layer at which knowledge is fused, were still significantly better than the one produced by vanilla DNN with no knowledge fusion. In addition to this, it is also interesting to evaluate the importance of Encoder-Decoder pre-training. Here, we have used an encoder-decoder based feature extractor in the latent space fusion module that learns to encode KDS predictions into appropriate knowledge representations. The encoder in the Encoder-Decoder network is pre-trained, separately from the end-to-end DeepLSF training as explained in Sec 4.3. However, when using two separate modalities the literature often leans towards latent space alignment techniques such as in visual question answering methods, where network from both modalities are trained together in an end-to-end optimization setting. This results in the latent space of both of the networks being aligned with each other. We also evaluate both of these settings. In the first setting, the encoder in the Encoder-Decoder network is pre-trained i.e. the setting used in DeepLSF while in the other, the encoder is trained in an end-to-end fashion along with the training of complete DeepLSF architecture. Table 7 shows results of both of these scenarios. As evident from table 7, extracting features from the knowledge stream beforehand as proposed in DeepLSF produces better results on both of the evaluation metrics. This may be due to the fact that extracting representative features beforehand makes the job of an end-to-end learning scheme easier since it does not have to learn to extract useful features from the KDS in addition to finding useful features from the raw data. This finding is also in line with the results presented by Palacio et. al. [44], where pre-training an Encoder-Decoder system at the input layer produced more robust results which were less prone to adversarial attacks in image classification tasks.

CONCLUSION
This paper presents DeepLSF, a novel knowledge fusion technique for time series forecasting that provides a mechanism to utilize the strengths of knowledge and data domains by combining features from both of these modalities. DeepLSF achieves this by making use of a Latent Space Fusion network that extracts relevant knowledge representations from the KDS and projects them into equivalent subspace, as that of the intermediate layer of DNN. The prowess of the proposed DeepLSF architecture is tested by evaluating its forecasting performance on three realworld datasets belonging to different application domains. DeepLSF is not only able to model the time series more efficiently by achieving lower training and validation losses when compared to the baseline DNN but also outperforms SOTA across the board, i.e. on every dataset. This shows the efficacy of DeepLSF in time series forecasting and also advocates the robustness of the architecture when dealing with different types of time series data belonging to different domains. The ability of DeepLSF to automatically extract knowledge representations makes it agnostic to the underlying knowledge base used in KDS. Moreover, the fusion network is capable of working with any off-the-shelf DNN architecture. This flexibility of DeepLSF will prove beneficial in areas where only a certain type of knowledge base is available or the use of different DNN architecture is required. Leveraging knowledge to complement the features drawn by DNNs will prove useful in many critical domains especially where expert knowledge is extremely desirable such as the health and the financial sector.