Sustaining the Effectiveness of IoT-Driven Intrusion Detection over Time: Defeating Concept and Data Drifts

—Although the existing machine learning-based intrusion detection systems in the Internet of Things (IoT) usually perform well in static environments, they struggle to preserve their performance over time, in dynamic environments. Yet, the IoT is a highly dynamic and heterogeneous environment, leading to what is known as data drift and concept drift . Data drift is a phenomenon which embodies the change that happens in the relationships among the independent features, which is mainly due to changes in the data quality over time. Concept drift is a phenomenon which depicts the change in the relationships between input and output data in the machine learning model over time. To address data drifts, we ﬁrst propose a series of data preparation steps that help improve the quality of the data and avoid inconsistencies. To counter concept drifts, we capitalize on an online deep neural network model that relies on an ensemble of varying depth neural networks that cooperate and compete together to enable the model to steadily learn and adapt as new data come, thus allowing for stable and long-lasting learning. Experiments conducted on a real-world IoT-based intrusion detection dataset, designed to address concept and data drifts, suggest that our solution stabilizes the performance of the intrusion detection on both the training and testing data compared to the static deep neural network model, which is widely used for intrusion detection.


INTRODUCTION
The Internet of Things (IoT) is progressively becoming part of our modern life, owing to the wide array of benefits it brings to many critical sectors such as healthcare, industrial engineering and transportation systems. The main idea of the IoT is to equip devices (e.g., wearable devices, home gadgets, etc.) with sensing and computing capabilities, enabling them to monitor the environment, collect data and process these data to come up with effective decisionmaking frameworks. The IoT devices are overwhelmingly characterized by limited processing, memory, storage and networking capabilities [1]. To face this problem, these IoT devices often need to interact with the fog and cloud computing layers to accomplish their tasks efficiently. Cloud computing enables the IoT devices to accommodate their storage and analytical needs, taking advantage of the unlimited and high-performance computing and storage capabilities it provides. Nonetheless, given that the cloud servers are predominatingly located far from the IoT devices, communications with the cloud entail significant response times. Fog computing extends the benefits of cloud computing through bringing the data processing capabilities closer to the IoT devices [2]. The prevalent scenario involving the IoT, fog and cloud layers is fog nodes receiving data from the IoT devices, launching IoT-enabled data analytics applications to extract insights from the data, sending back the insights to the IoT devices in the range of milliseconds, and peri- odically sending data summaries to the cloud for long-term historical storage. The IoT ecosystem is likely to be confronted with nonconventional security challenges. Besides the security vulnerabilities that the IoT faces due to the heterogeneity and resource limitations of the IoT devices, the interactions among the IoT, fog and cloud layers make room for additional vulnerabilities. Specifically, each of these layers has its own vulnerabilities that entice a large number of attackers. In addition, the IoT devices and fog nodes are heterogeneous by nature, giving attackers the freedom to alter their attacks strategies according to the special characteristics of every underlying IoT device and fog node type. Besides, the communications among the IoT, fog and cloud layers offer another possibility for attackers to launch a broad set of attacks. Also, the attacks that take place in any of these layers can easily propagate to the other layers through the communication channels unless effective security measures are taken. Therefore, there is a pressing need to come up with effective intrusion detection and prevention solutions that help protect the IoT ecosystem from these increasing attacks threats.

Problem Statement
Plenty of intrusion detection solutions have been lately proposed for protecting the IoT against advanced attacks [3], [4], [5], [6], [7], [8], [9], [10]. Many of these solutions rely on machine/deep learning to analyze the activities of the IoT devices and the communication traces of the IoT devices. The objective is to identify and discard the suspicious activities/patterns. Although these solutions usually yield high accuracy and low false positive and negative levels in static scenarios, the performance stability of the current solutions becomes questionable in dynamic timeevolving settings. More specifically, the predictive power of the machine learning model that is used to recognize the intrusion starts to deteriorate over time, a phenomenon that is known as model drift. In case this drift is not detected and remedied on time, it can entail detrimental impacts on the performance of the machine learning-based intrusion detection solution. Technically speaking, model drift can be divided into two types, i.e., concept drift and data drift.
Concept drift is the result of statistical properties of the target class labels changing over time. In fact, a predictive machine learning model is concretely a problem of designing a mapping function f that takes, as an input, data samples X to predict a class label y, i.e., y = f (X). This mapping is often designed in a static fashion, which implies that the mapping learned from historical data will remain valid on new future data and that the relationships among the input and output data will remain unchanged [11]. This might hold for some scenarios but is quite unrealistic in other scenarios wherein the relationships among the input and output data change as time evolves, leading to changes in the underlying mapping function. This implies that the predictions taken by a model trained on older data would no longer hold if the same machine learning model is trained on more recent data. Intrusion detection in the IoT is one of those scenarios wherein concept drift is likely take place due to its heterogeneous nature, causing the performance of the underlying machine learning model to significantly drop.
To better illustrate the concept drift phenomenon in an IoT-based intrusion detection scenario, let's take a realworld example. Suppose that a machine learning model was trained to identify DoS attackers in a healthcare-driven IoT ecosystem wherein wearable devices are deployed to monitor patients' conditions to detect respiratory diseases. When the machine learning model was originally trained, there was a specific idea of what a DoS attacker is and what features were relevant to identify them. For example, an IoT device that sends twenty messages within one minute to a fog node or to another IoT device was deemed to be a DoS attacker. Then, with the outbreak of COVID-19, some wearable devices might need to send more than twenty messages per minute in the highky affected areas to report some abnormal respiration rates. With this new usage pattern, sending twenty messages becomes normal that not only DoS attackers would do it, leading the concept of DoS attackers to drift. Thus, unless the machine learning model has been updated to include these new changes, the model would predict benign machine as DoS attackers, resulting in high false positive rates.
On the other hand, data drift takes place when some change happens in the independent variables, rather than the target variables as is the case with concept drift. Mapping the data drift phenomenon onto our previous example, it would not be the definition of a DoS attacker that changes, but rather the values of the features that we rely on to define these attackers. Data drifts can occur either due to some intentional malicious behavior or due to natural unintentional circumstances. An example of intentional malicious behavior could be attackers deciding, after observing that the intrusion detection system is quite effective in identifying their attacks, to change their attack behavior to confuse the system (e.g., switching their attack from the network layer to the application layer). In this case, the detection system would result in high false negatives, since it is not aware of these new conditions. On the other hand, the natural unintentional circumstances that could lead to data drifts include: • Intrusion detection sensors being replaced, which leads to modifying some units of measurement (e.g., from centimeters to inches). This phenomenon is called upstream data changes.
• Data quality problems, e.g., some broken or defective sensors that report illogical values (e.g., always reading 0).
• Covariate shift, i.e, the change in the relationships among the features or change in the distribution of the data (e.g., an intrusion detection system that is trained predominantly on DoS attacks, yet the real dataset has a large proportion of data probing attacks as well).

Contributions
To counter data and concept drifts in machine learningbased IoT-driven intrusion detection systems, we propose a two-phase solution. In the first phase, we propose a set of data preparation steps in order to mitigate the impacts of data drifts on the intrusion detection performance. Specifically, we employ a series of methods to detect and replace missing and inconsistent values in the data. We then use the Interquartile Rule method [12] to detect outliers, followed by a Principal Component Analysis (PCA) method [13] to reduce the curse of dimensionality. The objective of these data preparation steps is to mitigate the impacts of data drifts caused by upstream data changes and broken/defective IoT devices that report missing, illogical and inconsistent values. Thereafter, to mitigate the impacts of concept drifts, we employ an online Deep Neural Network (DNN) model [14] that automatically adapts the capacity of the neural network to improve the online predictive capability of the intrusion detection model. The online DNN begins with a shallow network that is characterized by a fast convergence and then progressively and automatically converts into a deeper model when new data are received to learn more complex relationships. The main idea is to modify the architecture of the traditional DNN through binding each hidden layer representation to an output classifier. Thereafter, the standard Backpropagation method is replaced by a Hedge Backpropagation [15] method that assesses the performance of each output classifier at every online round. The main idea of the Hedge algorithm is to keep the weights over different classifiers at each time period and then update these weights based on the observed loss. The Hedge backpropagation method capitalizes on the classifiers of varying depths with the Hedge algorithm, allowing us to train the DNN with an adaptive capacity as new data (with concept drift) come, while also fostering the share of knowledge among the shallow and deep networks. The main contributions of this paper are: • Proposing a two-phase long-lasting IoT-driven intrusion detection solution that provides high and sustainable performance in the presence of data and concept drift scenarios. To the best of our knowledge, our solution is the first to address the problem of sustaining the detection performance in IoT intrusion detection systems in the presence of concept and data drifts.
• Employing an online deep neural network that relies on an collection of competing and cooperating deep neural network classifiers of varying lengths to foster continuous learning and adaptation of the model as new data (with concept drift) come.
• Adopting a series of data preparation steps (i.e., data cleaning, outlier detection and dimensionality reduction) to mitigate the impacts of data drifts on the intrusion detection performance.
We compare the performance of the proposed solution with a traditional deep neural network using a series of experiments on the DS2OS traffic traces dataset [16], which is designed to tackle model drifts in IoT-driven intrusion detection systems.

Organization
In Section 2, we discuss relevant related approaches that propose detection systems in the IoT and highlight the originality of our solution. In Section 3, we discuss the data preparation steps that are proposed to mitigate the impacts of data drifts. In Section 4, we present the details of the online deep neural network model that is proposed to address the concept drift problem. In Section 5, we explain the environment used to conduct our experiments, and present and discuss the experimental results. Finally, we summarize the main findings of the paper in Section 6.

RELATED WORK
We discuss in the following the main machine learningbased intrusion detection systems in the IoT [17], [18].
In [3], Illy et al. propose an ensemble learning approach that is implemented in a fog-to-things environment, where the anomaly detection is done first at the level of fog nodes and then the attack classification is done at the level of the cloud. In [4], the authors propose an algorithmic hybridization of intrusion detection with a Deep Belief Network (DBN). The DBN examines and detects the active malicious behavior inside the IoT networks. In [5], the authors address the problem of accurately extracting feature information from big unlabeled data to identify intrusions. They propose a Softmax classification approach that is based on an improved deep belief network, with a pre-training technique that seeks to decrease the feature space of the original intrusion data.
In [6], the authors discuss the vulnerability of IoT networks to adversarial attacks. Adversarial attacks take place when an adversarial instance is fed as an input to a machine learning model. An adversarial instance represents a feature that has been intentionally perturbed with the intention of confusing the machine learning model to produce erroneous predictions. A variant of the Feedforward Neural Network (FNN) known as the Self-normalizing Neural Network (SNN) is proposed to classify the attacks. The main contribution of this work is exploring the process of normalizing the input features in a deep learning-based IDS adversely to improve the ability of the deep learning model to resist adversarial attacks.
In [7], the authors adopt deep learning to detect distributed attacks in Social Internet of Things. The performance of the deep model is compared with a traditional machine learning approach. Similarly, distributed attack detection is compared with centralized detection system. The authors of [9] put forward a hierarchical intrusion security detection model Stacked De-noising Auto-encoder Support Vector Machine (SDAE-SVM) based on a three-layer neural network. The objective is to investigate the applicability of deep learning de-noising auto-encoder for IoT security. A layer-by-layer pre-training and fine-tuning approach is implemented for dimensionality reduction. Similarly, an intelligent intrusion detection system for IoT environments is proposed in [19]. The system capitalizes on deep learning to identify the malicious traffic and is presented in the form of a Security-as-a-service solution that boosts the interoperability between several communication protocols that could be used.
The authors of [8] design a cloud-based distributed deep learning solution for detecting and mitigating Botnet and pshishing attacks in the IoT. The solution includes two components that operate in a cooperative manner, i.e., (1) a distributed Convolutional Neural Network (CNN) model that is integrated as a micro-security add-on into the IoT devices to recognize phishing and application-layer Distributed Denial of Service attacks; and (2) a temporal Long-Short Term Memory (LSTM) model that recognizes Botnet attacks, and ingest embeddings from CNN to identify distributed phishing attacks across multiple IoT devices. In [20], the authors advance a deep learning-based solution for detecting Internet Of Battlefield Things (IoBT) malware through analyzing the device's Operational Code sequence. The IoBT covers the complete realization of pervasive sensing, pervasive computing, and pervasive communication in military applications. The Operational Code sequence is converted into a vector space on which a deep Eigenspace learning approach is implemented to differentiate malicious applications from benign ones. The authors of [21] put forward a Particle Deep Framework for detecting attack behavior in IoT networks. The proposed framework included three tasks, i.e., (1) eliciting network data traces and validating their integrity; (2) capitalizing on the Particle Swarm Optimization (PSO) method to automatically tune the parameters of the deep learning network; and (3) Combining deep neural network with PSO to identify abnormal behavior in IoT-based smart homes.
Despite the importance of these approaches, they (implicitly) rely on the unrealistic assumption that the intrusion detection environment is static. Yet, in real-life, many factors such as the statistical properties of the target class labels, distribution of the data, relationships among features and data quality are subject to change over time. This makes these approaches quite vulnerable to concept and data drifts and thus prone to drastic performance deterioration. To address this problem, we first propose a series of data preparation steps to improve the quality of the data and to remove redundant and unnecessary dimensions, thus mitigating the impacts of data drifts. We then employ an online deep neural network that adaptively updates its capacity to improve the online predictive power of the intrusion detection model. We argue that online deep learning is essential in intrusion IoT-driven detection scenarios since the data comes in sequential order, making it unrealistic to train the whole dataset at once.

DATA PREPARATION: MITIGATING THE IMPACTS OF DATA DRIFTS
In this section, we explain the data preparation steps that are proposed to mitigate the data drift in the proposed intrusion detection system. These steps are classified into two major categories, i.e., data cleaning and dimensionality reduction, both of which are discussed hereafter. Before we proceed we the data preparation steps, we present in Table 1 an explanation of the features that are contained in the used dataset to help the readers better understand the preparation steps. We also present in Table 2 a breakdown of the attacks represented in the dataset along with their explanation.

Data cleaning
Data cleaning is a process that aims to handle missing and noisy values in real-world datasets. We first start with handling missing values. Two types of missing values can be distinguished in the used dataset, i.e., standard missing values and inconsistent values. The reason of having missing/inconsistent values might be due to some broken or defective intrusion detection sensors that report null or illogical values. The standard missing values occur when no data value is registered for a certain feature in a certain observation or when a value of "null" or "NA" is found. This type of missing values is easily detected by the Pandas 1 open-source data analysis and manipulation tool. In total, we detected 2, 156 missing values in the feature "value" and 148 missing values in the feature "Accessed Node Type" over a total of 357, 953 samples. On the other hand, inconsistent values are encountered when the value of a certain feature contradicts with the feature's type (e.g., a value of "false" in the feature "age" that is supposed to only have numeric values). To detect inconsistent values, we iterate, through the values of the feature 'value" which is of type numeric and try to convert these values to numeric. If the conversion cannot be done, this means that the value is not numeric and thus should be considered inconsistent. Moving to the "Accessed Node Type" feature which is categorical, we try to convert the values of this feature into the numeric type. If the conversion is possible, the value is identified as inconsistent. The replace the missing and inconsistent values, we employ the k nearest neighbors algorithm [22]. The algorithm first finds the k closest neighbors to the observation with missing/inconsistent data and then takes the weighted mean over these neighbors. The weight represents the distance to neighbor; meaning that the closer the neighbor is to the observation with missing/inconsistent 1. https://pandas.pydata.org/ values, the more weight it will have when computing the mean. The missing/inconsistent values are finally replaced with the computed weighted mean. The next step is to detect outliers. An outlier is a data point that lies an abnormal distance from the other data points in the same feature space. The reason of having an outlier might be a variability in the measurement across sensing devices (e.g., a device being replaced) or it may due to some experimental error. Data outliers can mislead the training process, resulting in considerably less accurate models and longer training times. To detect the outliers in the used dataset, we opt for the Interquartile Rule [12]. The main advantage of this method is that it allows us to obtain, at a glance, several information about the dispersion of the dataset such as the minimum, maximum, median and quartiles, which makes it quite useful for assessing the symmetry (or skewness) problem of the data distribution. A quartile is a type of quantile that divides a series of data points into a set of equal parts, known as quarters. The first quartile (Q1) is the 25th percentile of the ordered data series, the second quartile (Q2) is the median, and the third quartile (Q3) is the 75th percentile. The Interquartile Rule method is composed of the six following steps [23]: In the used dataset, the Interquartile Rule is only applicable on the value feature whose values are continuous. Based on our results, any value smaller than −27.23 or greater than 48.05 is an outlier. In our case, no outlier has been identified.

Dimensionality Reduction
Dimensionality reduction is the process of decreasing the number of dimensions (aka features) in the dataset. Having a big number of dimensions in the feature space can imply that the volume of that space is very large, and in turn, that the data rows that we have in that space often represent a small and non-representative sample. This can negatively influence the performance of machine learning, a problem that is often referred to as the curse of dimensionality. The curse of dimensionality happens when the density of the data samples exponentially dwindles with the increase of the dimensionality. Keeping on adding features without increasing the number of training samples in parallel leads the dimensionality of the feature space to grow and become sparser and sparser. This increased sparsity makes it easier for the machine learning model to discover a "perfect" solution, a problem that is known as overfitting. Therefore, it is often a good practice to decrease the number of input features, thus decreasing the number of dimensions of the feature space. The most common method for dimensionality reduction is that of Principal Component Analysis (PCA) [13], which we will be using in our solution. PCA enables  Amounts to 3% of the dataset us to convert an array of n dimensions into a reduced array of p < n dimensions. This means that, in our case, we are we are looking to reduce the 13 dimensions of the used dataset to a number < 13 without losing much information. This is because, usually, some dimensions don't provide much information to the machine learning process. To reduce the dimensionality, the main idea of PCA is to derive the "principal components", which tell us which combinations of dimensions have the biggest variance and therefore give most of the information. Below are steps that PCA follows to accomplish the dimensionality reduction: • Standardize the feature values.
• Compute the covariance matrix of the features.
• Compute the eigenvectors and eigenvalues of the covariance matrix and order the eigenvectors in descending order to identify the principal components.
• Determine k, the number of principal components to be kept for the machine learning training. In the following, we give the details of each of these steps but before, it is worth noting that the non-numerical variables (except for the target variable) need first to be converted into numerical ones since PCA only works on numerical variables.

Feature Values Standardization
The objective of this step is to standardize the range of the values of the initial dimensions in such a way to make them contribute equally to the analysis. In particular, PCA is largely sensitive in terms of the variances of the variables. Thus, in case the ranges of the variables have large differences, the variables with larger ranges will dominate over those with small ranges (e.g., a variable whose ranges are between 0 and 500 will dominate over a variable whose ranges are between 0 and 1), leading to biased results. Thus, standardizing data to comparable scales is essential to prohibit this problem. Technically speaking, standardization is carried out by subtracting, for each value of each dimension, the mean and diving by the standard deviation.

Covariance Matrix
This step helps us understand how the variables vary from the mean, with respect to each other, i.e., if there is any relationship between these variables. In some cases, variables are highly correlated in the sense that they comprise redundant information. Thus, computing the covariance matrix enables us to characterize these correlations. A covariance matrix is an n × n symmetric matrix (with n being the number of dimensions) whose entries are the covariances associated with all possible pairs of the variables. Having computed the values of the covariance matrix, we mainly need to look at the signs of the covariances. If the covariance is positive, this means that the two underlying variables are correlated, i.e., they increase/decrease together. If the covariance is negative, this means that the two underlying variables are inversely correlated, i.e., one variable increases when the other decreases.

Eigenvectors and Eigenvalues
The eigenvectors of the covariance matrix represent the directions of the axes where most of the variance (and thus most of the information) lies, i.e., the Principal Components. Eigenvalues simply are the coefficients associated with the eigenvectors, quantifying the amount of variance contained in each single Principal Components. Thus, eigenvectors and eigenvalues always come in pairs and have a number that is equal to the number of dimensions in the dataset. Having computed the eigenvectors and eigenvalues, the next step would be to sort the eigenvectors, largest to smallest, in order of their eigenvalues. The sorting step gives us the principal components based on their significance. To obtain the percentage of variance contained in each principal components, we simply divide the eigenvalue of each component by the sum of the eigenvalues.

Determining the Principal Components to Keep
Principal components are designed in such a way that the first principal component would include the largest possible  Figure 1 the cumulative explained variance ratio as a function of the number of principal components. This plot quantifies how much of the total 12-dimensional (the dataset's dimensions except for the target variable) variance is contained in the first N components, with N being represented on the Xaxis. By looking at the figure, we notice that the first two principal components account for approximately 60% of the variance and that need at least 9 principal components to obtain 100% of variance. Therefore, we choose to keep 9 principal components from the 12 components that we initially had.

ONLINE DEEP NEURAL NETWORK: MITIGATING THE IMPACTS OF CONCEPT DRIFT
In this section, we provide the details of the online DNN, which is mainly inspired from the work presented in [14]. The online DNN is proposed to counter the impacts of concept drift on the intrusion detection system. This is achieved through automatically adapting the capacity of the neural network to improve its performance in making online predictions in the presence of concept drifts.
Consider an online attack classification task whose aim is to learn a function F on a collection of training samples D = {(x 1 , y 1 ), . . . , (x K , y K )} that come in a sequential manner, with x k being a d-dimensional sample, y k ∈ {0, 1} C being the class label attributed to x k and C being the number of class labels. Let the classification task be denoted byŷ k and the cumulative prediction error by E K = 1 K K k=1 τ (ŷk = y k ). E K is a characteristic function whose value is 0 if the condition is false and 1 otherwise. In order to minimize the prediction error over the sequence of K samples, we need to formulate a loss function. Let Γ(F (x), y) be a cross-entropy loss function. At each online classification iteration, whenever a sample x k comes, the online DNN makes a prediction. Then, the environment uncovers the ground truth of the class label, allowing the online DNN to update the model using an online gradient descent methodology.
The prediction function F is a series of stacked linear transformations, each of which is followed by a nonlinear activation. Given a data sample x k as an input, the (recursive) prediction function of a deep neural network with H hidden layers (l (1) , . . . , l (H) ) using the Hedge Backpropagation [15] is defined by Equation (1): and where α is an activation function (e.g., sigmoid, ReLU, etc.). Equations (1) and (3) depict a feedforward process of the deep neural network. The hidden layers l (h) embody the feature representations that are learnt during the training process. φ (h) and δ are newly introduced parameters and thus are not part of the traditional deep neural network. They hence need to be learnt (later in this section). The main idea of this prediction function is that, different from the original deep neural network wherein the final prediction is based on the feature representation l (H) , the prediction is the result of a weighted combination of H + 1 classifiers learnt using the feature representations from l (0) to l (H) . The weight assigned to each classifier is represented by δ (h) which is computed using the Hedge Algorithm [24]. At the first iteration, all the weights δ are uniformly distributed, where δ (h) = 1 H+1 , h = 0, . . . , H. Then, at each iteration, the classifier f (h) carries out a predictionŷ k (h) . Once the ground truth is uncovered, each classifier's weight is updated according to the loss experienced by the classifier as per Equation (4): where ∈ (0, 1) is a discount factor and Γ(f h (x), y) ∈ (0, 1) is an adaptive loss function. At the end of each iteration, the weights δ are normalized so that h δ h k = 1. The adaptive loss function is given in Equation (5).
In order to learn the parameters φ (h) across the different classifiers, the online gradient descent methodology [25] is used as per Equation (6).
Where η is the learning rate. The feature representation parameters W (h) can be updated through backpropagating the error derivatives from each single classifier f (h) .
Technically speaking, by applying the online gradient descent methodology and the adaptive loss function given in Equation (5), we can update the feature representation parameters W (h) as per Equation (7).
where ∇ W (h) Γ(f (i) , y k ) is calculated through backpropagating the classifier's f (i) error derivatives.
The logic of the online DNN is further summarized in Algorithm 1. Use Equation (7)

16: end procedure
In all, the main advantages of the online DNN model are: (1) adjusting the depth of the neural network based on the performance of the classifier at that depth, achieving an online learning with multiple experts; (2) modeling the online learning process as an ensemble of multi-depth networks that cooperate (through the sharing of the feature representations) and compete (using the Hedge algorithm) to ameliorate the performance of final predictions, thus enabling long-lasting learning through allowing the model to constantly learn and adapt as more data come; (3) different from the traditional online machine learning models that undergo slow convergence in deeper network when the concepts drift occurs, the proposed online DNN quickly adapts to the concept drift thanks to the Hedge backpropagation method.

EXPERIMENTAL EVALUATION
In this section, we first explain the environment used to conduct our experiments and then provide experimental results and analysis.

Experimental Setup and Datasets
We implement a 16-layer online deep neural network using the online backpropagation algorithm. Each hidden layer of the network comprises 100 units. The rectifier (ReLU) [26] is used as an activation function and the learning rate is set to 0.01. A classifier was assigned to each of the 15 hidden layers (except for the input layer) with depth from 2, . . . , 16. The discount factor is set to 0.99. The implementation was carried out in Keras Python 2 on Tensorflow.
The DS2OS traffic traces dataset [16] is used. The main motivation for choosing this dataset stems from the fact is that it is designed to address model drifts in IoT-based intrusion detection systems. The dataset records communication traces between a multitude of IoT devices, all of which are part of a common middleware, i.e., the Distributed Smart Space Orchestration System (DS2OS). Eight types of IoT devices are considered, i.e., light controllers, thermometers, motion sensors, washing machines, batteries, thermostats, smart doors and smartphones. Each device is characterized by an address (e.g., / agent2 / lightcontrol2), a type (e.g., light controler) and a location (e.g., kitchen). The dataset consists of 357, 000 records spanning over 13 dimensions. In Table 1, we explain the meaning of each of these dimensions. Moreover, we give in Table 2 a breakdown of the attacks represented in the dataset along with their explanation. We compare the performance of our solution with the traditional deep neural network, which has been extensively used for intrusion detection in the IoT.
To split the dataset into training and test sets, the k-fold cross-validation approach [27] is employed, while setting k to 10. This approach divides the dataset into k batches, each of which is selected every time to be the test set and the other k − 1 batches are integrated together to form the training set. Then, the accuracy of the classifier is computed by averaging the error over all the k iterations. The main advantage of cross-validation stems from its effectiveness in minimizing the bias of the classifications toward the dataset's structure, given that every data sample needs to be part of the test set exactly once and of the training set k − 1 times.

Results and Discussion
To start our comparisons, we first provide in Tables 3 and  Table 4 the confusion matrices of our solution and the regular DNN respectively over the different attack types. The confusion matrix is a matrix that is used to quantify the performance of a classification model on a test data set whose true values are known. The columns of the confusion matrix represent the predicted classes and the rows represent the actual classes. The confusion matrix is used to compute the true positive (TP), false positive (FP), false negative (FN) and true negative (TN) of a machine learning model. True positive is a case wherein the machine learning model predicts a certain instance is an attack of type a i (e.g., DoS), matching the actual ground truth. True negative occurs when an attack of type a j = a i is not classified as being of type a i . False negative is a case in which the machine learning model predicts an attack of type a i as being another type a j = a i of attack. False positive is a case in which the machine learning model predicts an attack of type that is not a i as being of type a i .
By looking at Table 3, we notice that our solution correctly classifies most of the attacks, with only 6 misclassifications in the case of DoS attacks (classified as normal), 2 misclassifications in the case of malicious control attack (one 2. https://keras.io/   To better understand the impacts of the data and concept drifts on both compared solutions, we provide in Tables 5  and 6 several performance metrics (i.e., accuracy, precision, recall and F1 Score) on the training and the testing data respectively. Accuracy is the ratio of number of correct predictions to the total number of input data points. It is computed as per Equation (8).
Precision is the number of correct positive results divided by the number of positive results predicted by the classifier. It is given in Equation (9).
Recall (or sensitivity) is the proportion of actual positives that are identified as such. The recall tries to answer the following question: Of all the attack classes, how many did our model actually identify as such?. It is given in Equation (10).
F1 score is the harmonic average of precision and recall and is computed as per Equation (11).
We notice from Table 5 that our solution achieves comparable percentages of accuracy, precision, recall and F1 score to the regular DNN across all the attack classes with a slight improvement to our solution. Yet, by comparing Table 5 and Table 6, we notice that the performance of the regular DNN considerably drops for all the attack classes, especially the DoS and normal classes. For example, the accuracy of detecting DoS attacks drops from 96.1% on the training data to 86.2% on the testing data. Similarly, the accuracy of detecting normal cases drops from 94.5% on the training data to 77.1% on the testing data. The reasons for this significant drop are the concept and data drifts that took place in the testing data post training. On the other hand, our solution keeps a stable performance on the testing data, showing a large resilience to the concept and data drifts. For example, in the case of DoS attacks, the accuracy of our solution drops only from 98.5% to 98.4%. In the case of the normal class, the accuracy of our solution drops only from 97.7% to 96.8%. These results confirm the effectiveness of the data preparation steps as well as the online DNN in considerably reducing the impacts of data and concept drifts, respectively.
In Fig. 2, we study in more detail the accuracy of our solution and the regular DNN on both the training and testing data, while varying the number of data instances from 55, 000 to 350, 000. Several experiments are performed on different test sets and the average accuracy is taken. The first observation that can be drawn from the figure is that increasing the number of data instances has a positive impact on both approaches, on both the training and testing sets. The second observation is that accuracy gap between the training and testing data is almost negligible in our solution, compared to a large gap in the case of the regular DNN model. This again show the large resilience to the concept and data drifts that our solution enjoys compared to the regular deep neural networks.
In Figure 3, we study the false positive ( Figure 3a) and false negative (3b) percentages entailed by our solution and the regular DNN, while varying the number of data instances from 55, 000 to 350, 000. Several experiments are performed on different test sets and the average false positive and false negative rates are taken. We notice from the figure that the false positive and false negative percent-   Similarly, with 350, 000 data instances, the false negative percentage entailed by our solution is 0.58% compared to 5.98% for the regular DNN. Another conclusion that can be taken from this Figure is that the regular DNN model is more impacted by the concept drift than the data drift, as it entails higher percentages of false positive than false negative.

CONCLUSION
Although the current intrusion detection systems in IoT environments achieve high accuracy and low false positive and negative rates, maintaining and stabilizing their performance over time remains questionable in the presence of concept and data drifts. In this work, we address this challenge and propose first a series of data preparation steps (i.e., data cleaning and dimensionality reduction) to mitigate the impacts of data drifts that affect the quality of the data and risk to change the relationships among the input features. Thereafter, we employ an online deep neural network model that represents the online learning process as an ensemble of multi-depth networks that cooperate (through sharing the feature representations) and compete (using the Hedge algorithm) to enable the model to constantly learn and adapt as new data come, thus achieving long-lasting learning. We conduct a series of experiments on a realworld dataset that is designed to represent model drifts in IoT-based intrusion detection systems. The results suggest that our solution significantly stabilizes the performance of the intrusion detection process not only on the training data but also on the testing data in the presence of concept and data drift scenarios, compared to the static deep neural network model which is widely used in IoT-based intrusion detection. Moreover, the results suggest that our solution reduces the false positive by approximately 6% and false negative by approximately 4.5%, compared to the static deep neural network model.