SALAD: An Exploration of Split Active Learning based Unsupervised Network Data Stream Anomaly Detection using Autoencoders

—Machine learning based intrusion detection systems monitor network data streams for cyber attacks. Challenges in this space include detection of unknown attacks, adaptation to changes in the data stream such as changes in underlying behaviour, the human cost of labeling data to retrain the machine learning model and the processing and memory constraints of a real-time data stream. Failure to manage the aforementioned factors could result in missed attacks, degraded detection performance, unnecessary expense or delayed detection times. This research evaluated autoencoders, a type of feed-forward neural network, as online anomaly detectors for network data streams. The autoencoder method was combined with an active learning strategy to further reduce labeling cost and speed up training and adaptation times, resulting in a proposed Split Active Learning Anomaly Detector (SALAD) method. The proposed method was evaluated with the NSL-KDD, KDD Cup 1999, and UNSW-NB15 data sets, using the scikit-multiﬂow framework. Results demonstrated that a novel Adaptive Anomaly Threshold method, combined with a split active learning strategy offered superior anomaly detection performance with a labeling budget of just 20%, signiﬁcantly reducing the required human expertise to annotate the network data. Processing times of the autoencoder anomaly detector method were demonstrated to be signiﬁcantly lower than traditional online learning methods, allowing for greatly improved responsiveness to attacks occurring in real time. Future research areas are applying unsupervised threshold methods, multi-label classiﬁcation, sample annotation, and hybrid intrusion detection.


INTRODUCTION
Intrusion Detection Systems (IDS) monitor a computer network for cyber attacks. Traditional intrusion detection techniques rely on human subject matter experts to carefully produce signatures that can accurately detect a cyber attack at the network layer. For over a decade research has focused on improving IDS with machine learning (ML) methods in order to reduce the overall demand for human effort [1]. The majority of this research has centred around misuse detection whereby the ML based IDS is trained using a data set in which all cyber attacks are labeled, the drawback of this being that only the labeled attacks will be known to the model, missing unknown or new attacks, and that labeling of the initial data set is a time consuming and complex task prone to human error. An alternative to misuse detection is to use an anomaly detector whereby only the 'normal' network data is learned and any significant deviations treated as an anomaly meaning that new attacks will be detected, a challenge with this approach is the potential for false positives.
IDS capture network packet data directly from the network, requiring efficient real-time processing of each new packet as part of a continuous data stream. This network data stream is non-stationary and can change over time, a characteristic known as concept drift, which requires the ML model to adapt in order that detection performance is not degraded [2]. Adaptation requires detecting a change in the posterior probability of a class label, necessitating the ground truth to be known. Active learning (AL) is an attempt to lower the labeling cost, and speed up the adaption times, of change detection by employing uncertainty or random strategies according to a labeling budget [3] An hypothesis that this research aims to test is that anomaly detectors monitoring non-stationary network data streams will experience increased false positives over time, which can be corrected by applying adaptation techniques to update the anomaly detector. This will be expanded by a further hypothesis that active learning strategies can provide good adaptation with minimal labeling cost, and reduced learning times, for anomaly detection.
Unsupervised learning allows for a model to be trained without all the class labels being known, typically achieved by learning a representation of the underlying data structure. Common unsupervised techniques, such as clustering, are impeded by high degrees of time complexity and memory usage [4]. Models based on neural networking are gaining increased attention in the IDS field and a type of feed-forward neural network, the autoencoder, is able to learn the representation of data without class labels by encoding a latent representation of the data, which can be utilised for anomaly detection by calculating the error of the decoded output from the original, and comparing to a predetermined anomaly threshold [5]. This research aims to test the hypothesis that autoencoders provide an effective online anomaly detector for network data streams when combined with active learning methods.
The remainder of this paper is organised as follows: Section 2, introduces related work; Section 3, describes the proposed Split Active Learning Anomaly Detector (SALAD) method; Section 4, presents the evaluation results; Section 5, discusses how SALAD provides a low cost anomaly detector for network data streams; and Section 6, presents conclusions.

Neural Networking Anomaly Detection
Intrusion detection systems can be either anomaly based or misuse based, where the former learns the normal behaviour and detects deviations, allowing for detection of previously unseen, unknown attacks, and the latter learns known attack signatures resulting in high levels of detection accuracy [6]. A challenge with network data streams is that they generate large volumes of data that become increasingly expensive for a human expert to analyse and correctly label. Anomaly detectors are beneficial because they only need to learn the representation of a single 'normal' class from which anomalies can be distinguished meaning that new, previously unseen, attacks can be detected without requiring new data labels and re-training of the model [6]. Unsupervised machine learning methods are well suited to the anomaly detection task as they can learn the representation of the underlying data to determine normal and anomaly classes [6], as well as learning useful features that better separate the classes. Buczak and Guven [1] have provided a comprehensive survey of IDS machine learning techniques, including anomaly detection, in most cases misuse and anomaly detection are combined into a hybrid system. This review briefly introduces recent studies within the unsupervised anomaly detection space, adopting neural networking methods familiar to the visual processing area, for comparison to the proposed approach.
Alrawashdeh and Purdy [7] evaluated Restricted Boltzmann Machines (RBM) arranged into a deep belief network combined with a logistic regression classifier trained using back propagation. Although the study claims to be 'anomaly' based the model is actually trained to identify known classes so would be more 'misuse' based in its approach. The accuracy of their model, with the 10% KDD Cup 1999 data set, is 97.91% [7]. The authors further build on there work by replacing the RBM activation function with a novel 'Adaptive Linear Function' (ALF) for intrusion detection with the aim of improving accuracy and convergence time [8]. Evaluated with KDD Cup 1999 and NSL-KDD data sets, the accuracy was 98.59% and 96.2% respectively [8].
Roshan et al. [9] proposed a novel intrusion detection approach using a Clustering Extreme Learning Machine (CLUS-ELM) method. This method allows for both unsupervised and supervised updates to the model, using a decision maker element to perform informed change detection based on the cluster output, in this design unsupervised refers to guessing the correct cluster for a given data sample as opposed to being told the label by a 'human expert'. The mean square error calculation used by the decision maker will still require the ground truth to be known. Results were evaluated using the NSL-KDD data set, with a detection rate for known attacks of 84% and 81% for unsupervised and supervised modes, 77% and 84% for unknown attacks, where the false positive rate was less than 3% [9]. The author remarks that the better unsupervised detection rates for known attacks compared to the supervised ones are unexpected and could be due to inaccuracies in the NSL-KDD data set [9].
Chen, Cao and Mai [10] proposed an offline anomaly detection method whereby Convolutional Neural Networks (CNN) are used to extract features which are then condensed into a spherical hyperplane by a deep Support Vector Data Description (deep-SVDD) technique. The method is trained on normal samples only so that such normal samples concentrate around the center of the sphere and attack samples concentrate on the outside as outliers allowing them to be detected as a one-class anomaly detector. Their method was evaluated with the KDD Cup 1999 data set, achieving an accuracy of 96% when all attack types are present.
Hassan et al. [11] proposed a combined CNN for feature reduction and Weight Dropped, Long Short Term Memory (WDLSTM) network for representation of dependencies among features, using the connection drop out regularisation method. The proposed supervised learning network was evaluated with the UNSW-NB15 data set, returning an F1-Score of 0.88 for abnormal samples and overall accuracy of 97.17% via offline holdout training.
The reviewed studies all demonstrate different network topologies for cyber intrusion detection, all of which have elements of supervised learning and traditional offline batch training. They do not address the problem of a truly unsupervised anomaly detector for online data streams as will be explored in this paper.

Autoencoder Anomaly Detection
An autoencoder is a type of feed-forward neural network that uses an encoding function to produce a latent code representation of the input data, and a decoding function to reconstruct the input from the code representation [12]. The mean square error between the reconstructed output and original input can be calculated using equation 1, where f is the encoding function and g is the decoding function [12], which can then be compared to an anomaly threshold to label a sample as either normal or anomalous.
In our previous work [12], we reviewed autoencoder based anomaly intrusion detection methods, whereby single layer denoising models [13], Long Short Term Memory (LSTM), Recurrent Neural Network [14], [15], ensembled stacked autoencoders [16], [17], and sparsely connected networks [18], [15] were demonstrated across a range of IDS data sets. Vaiyapuri and Binbusayyis [19] evaluated a number of autoencoder network architectures for anomaly detection, finding the use of a contractive penalty to regulate the network provided the best performance when evaluated offline using the NSL-KDD and UNSW-NB15 data sets.
A number of methods were proposed in the literature to determine the anomaly threshold, an important parameter in deciding whether to label a sample as a positive detection. The threshold can be set to the average RE value observed during training [19]. Naïve Anomaly Threshold (NAT) sets the threshold at the maximum observed RE during training [16]. Stochastic Anomaly Threshold (SAT) [13] sets the threshold based on the best observed accuracy when stepping through threshold values between the mean and 3 * standard deviation of the normal sample distribution. Nicolau and McDermott [13] proposed an anomaly threshold method using Kernel Density Estimation.
Aiming to find an optimal network configuration, we evaluated in [12], an undercomplete autoencoder, regulated with connection dropout, with a prequential online test using the KDD Cup 1999 and UNSW-NB15 data sets. Applying a single layer autoencoder with dropout probability of 0.1, using the Stochastic Anomaly Threshold method, provided an accuracy of 98% and F1-score of 0.812, using the KDD Cup 1999 data set, with a significantly improved running time compared to traditional Naïve Bayes (NB) and Hoeffding Adaptive Tree (HAT) online methods. Evaluation on the UNSW-NB15 data set using a 3-layer network and dropout probability of 0.2 returned an accuracy of 79.1% and F1-score of 0.703. The results showed that the SAT threshold performed better than the NAT, and that more complex data sets benefit from experimenting with the number of layers and regularisation of the network.

Concept Drift Detection with Active Learning
Non-stationary network data streams may experience real concept drift [2], whereby the posterior probability of classes will change over time due to changes in network behaviors, the cause of which could be either benign or adversarial in nature. The posterior probability is defined as p(y|X) which represents the probability of class y given an observation X [2]. Autoencoders determine outliers using the RE-score, based on the hypothesis that adversarial behaviour deviates from the learned 'normal' representation resulting in scores above the anomaly threshold. Real concept drift presents a challenge that the aforementioned hypothesis will weaken overtime, with changing benign data also scoring above threshold, raising the false positive rate. Increasing the anomaly threshold does not present an optimal solution as although the false positive rate may lower, the false negative rate could increase and so is not recommended. The hypothesis of this research was that a change in underlying 'benign' network behaviour will result in a raised false positive rate and that learning the representation of the new behaviour will remedy this effect. Note that the change in benign activity could be from an unplanned change such as a network fault, in which case the usefulness of the anomaly detector is extended to a fault detector, however for the purposes of this research this will not be considered further.
Change detection is a set of methods that proactively monitor the data stream for concept drift [2]. Traditional methods such as adaptive windowing and statistical process control (SPC) [2], rely on fully supervised labels and are therefore not well suited to applications where data labeling is expensive, such as network data streams. Moreover unsupervised techniques that rely solely on monitoring a change compared to a reference distribution will not always detect real concept drift [20]. Sethi and Kantardzic [21] proposed a semi-supervised Margin Density Drift Detector (MD3) to reduce labeling costs through an active learning approach. First, using an unsupervised method, samples that fall below an uncertainty threshold are added to the margin. Density of the margin is compared to a training reference distribution to detect drift before confirming by testing accuracy with data labels, sensitivity can be adjusted through a varying factor of the reference distribution's standard deviation. A fading factor is utilised to give greater importance to more recent samples within a moving average of margin density [21]. MD3 can work with ensembles, calculating if a sample should be included within the margin by comparing the distance between the mean predicted class probabilities to the margin threshold (θ), given by equation 2. A possible benefit of this approach would be that the change in density of uncertain samples that are borderline outliers could indicate a concept drift that requires further analysis, prompting further action such as re-training. As the anomaly detector only requires labeled normal data to re-train, this would be a cheaper approach to other methods that require fully labeled data. A possible drawback is that the frequency of drifts could demand increased human expertise. Evaluation with the NSL-KDD data set reported an accuracy of 89.4 and 89.9 % using the SVM and random subspace ensemble methods, respectively where the first 15% of the data stream is used as a training set. The total labeling cost was 7.9%.
Shan et al. [22] also proposed an AL change detection strategy based on margin uncertainty, 'OALEnsemble', however in this approach the ensemble members are trained on different windows of the data set, with a stable classifier and a series of short window 'dynamic' classifiers that are continually replaced as new blocks of the data stream are processed, to balance the detection of both sudden and gradual concept drifts. Similar to [21], labeling is restricted to samples within the uncertainty margin, with the addition of a random labeling algorithm to randomly include samples outside of the margin where drift may also be occurring [22]. The stable classifier is incrementally trained with all new data, whilst dynamic classifiers are only trained on the most recent block and given a weight, providing importance to more recent data [22]. The incremental update of the stable classifier is restricted to models that feature local replacement such as very fast decision trees (VFDT) [2], and so would not be appropriate for autoencoder methods. The labeling rate is constrained by pro-actively adjusting the sensitivtiy threshold in order to manage the cost of the algorithm during periods of high uncertainty. Random sampling is desirable as it enables the classifier to be trained from the whole distribution, reducing bias [3]. The idea of gradually retraining the autoencoders with new 'normal' data in response to concept drift, whilst retaining the previous models for a period of time, moderating their importance with a weight scheme, could allow for the detection of both gradual and sudden changes in benign behaviour, however the problem of global replacement must be carefully considered as training on small data sets could degrade the autoencoders ability to represent normal data.
Dang [23] evaluated AL for IDS, using a novel strategy with the Naïve Bayes classifier, selecting instances with the greatest distance from the population distribution of probabilities under the hypothesis that a bigger change of P (A|B) reflects a rare event that should be learned. The method was evaluated with the CICIDS 2012 data set, achieving an AUC-score of 90% compared to 85% with the uncertainty strategy with 10% of labeled data, and performance decreasing beyond this. The author argues that this indicates that good quality data is more important over larger volumes of data [23]. It may also be true that the method reduces class imbalance by proactively sampling examples with weaker performance that could reflect minority classes.
Zhang et al. [24] evaluted an Open-CNN method trained by AL labeling the 'unknown' detected attacks. Accuracy with the CTU data set was near equivalent to 100% label cost at just 1% of labeled attacks using an uncertainty strategy, demonstrating that only a low label cost is necessary to train the ML model.
Zliobaitė et al. [3] discussed three requirements for AL strategies: 1) balancing the labeling budget over time, 2) detect changes anywhere within the problem space and 3) preserve the distribution for unbiased change detection. A number of strategies were evaluated against these requirements, including fixed uncertainty as demonstrated by [21], and uncertainty with randomisation, whereby the sensitivity threshold is randomly selected from a standard distribution to occasionally include samples outside of the uncertainty margin. Fixed uncertainty is only able to satisfy requirement one, and randomised uncertainty satisfies requirement one and two, but neither can preserve the probability density of labeled data compared to the original distribution, which can bias the model [3]. A further split strategy is introduced which satisfies all three requirements by splitting the the data stream into two, using uncertainty and random strategy exclusively on either stream. Both streams are used for training, but only the randomised stream is used for change detection [3]. Shan et al. [22] presents a split strategy, although in this approach adaptation is blind, based on incrementally updating the ensemble members with both uncertainty and random labels, offering no proactive change detection, this could reduce overall adaptation speeds [2].
An objective of this research was to satisfy all three AL requirements outlined byŽliobaitė et al. [3]. MD3 [21] will be biased towards uncertain samples and will miss change occurring outside of the margin which will affect overall detection performance. The work of Shan et al. [22] could be further improved by introducing pro-active change detection method to the randomly labeled data as suggested byŽliobaitė et al. [3] in order to increase adaptation time. In this research random, uncertainty, variable uncertainty, split and blind strategies are compared. The proposed hypothesis is that only the split strategy with informed change detection approach will be able to satisfy all three requirements and that the change detection approach will offer faster adaptation times to a blind approach. The informed approach can use a well known change detector such as Drift Detection Method (DDM) [25] to monitor the classification error of the anomaly detector.

METHODS
The aim of this research was to explore that autoenoders can provide a low cost online anomaly detection solution when combined with AL methods. In our previous work [12] we evaluated dropout probability, NAT with decay and SAT anomaly thresholds, and single vs stacked network structure, to find optimal autoencoder parameters. Building on this work, in this paper, we further introduced a novel Adaptive Anomaly Threshold (AAT) method and also evaluated an AL based Active Stream Framework (ASF) [3] with which we compared blind, random, uncertainty, variable uncertainty and split AL strategies. The uncertainty strategy was adapted for use with autoencoders using a novel distance from RE method. All methods were evaluated using a prequential, interleaved test-then-train method [2], whereby the model is first tested on a previously unseen sample before training in a chunk wise fashion [12], after an initial period of pre-training. Results were compared against traditional Naïve Bayes (NB) and Hoeffding Adaptive Tree (HAT) online learning methods using the KDD Cup 1999 1 10% [26] and UNSW-NB15 2 [27] data sets.
Observed metrics during evaluation included: accuracy, F1-score, kappa and total running time. For prequential evaluation the scikit-multiflow default of updating evaluation metrics every 200 samples was used.

Adaptive Anomaly Threshold
From evaluating the make up of the data stream and performance achieved with both the NAT and SAT threshold methods [12] a proposed hypothesis was that chunks of the data stream that contained only normal samples benefit from a naïve approach whereby the maximum RE is used, therefore all samples will fall below this value, giving an accuracy of 100%. For anomaly samples the second hypothesis was that between the maximum value and the mean observed RE a threshold can be found that best splits normal and anomaly samples, similar to the stochastic approach. A third hypothesis was that the mean RE will change overtime due to concept drift, and so will become less sensitive to more recent samples when taken over a long stream.
To address the above three hypothesis an 'Adaptive Anomaly Threshold' (AAT) method was proposed that combines the NAT, SAT and Fading Factor [30] methods. The proposed method is given in algorithm 1. Normal samples were used to update the fading average RE-score over the stream, using a fading factor α [30] in order to give more importance to more recent sample values, satisfying hypothesis 3 above. The maximum RE of normal samples over the data stream is also recorded and used to find the first value of the anomaly threshold φ. If the initial maximum value of φ achieves an accuracy of 1.0 or 100%, then this fulfilled the first hypothesis that all samples are normal and no further action was required. Otherwise hypothesis 2 is assumed and a stochastic approach was then used to step through potential threshold values until the highest accuracy is found.

Algorithm 1: Adaptive Anomaly Threshold
Input : autoencoder m, X, y, threshold φ, step size v ← [> 0], fading factor α Output: φ / * Initialise fading sum, fading increment, and max RE variables * / 1 S 0 ← 0; N 0 ← 0; RE max ← 0; / * Find the fading mean RE of normal samples * / 2 X y←0 ⊆ X; The proposed autoencoder anomaly detector is depicted in figure 1. The sample X is inputted to the autoencoder network from which a Reconstruction Error (RE) is produced based on the loss value between the approximate output and the original input. The RE is compared to an anomaly threshold value with samples scoring above threshold being labeled as an 'anomaly' and those below being 'normal' or benign. If a label Y is provided then the anomaly threshold is updated using a novel adaptive anomaly threshold method, which also maintains a memory of the population mean RE throughout the data stream by using a fading factor [2] memory mechanism to prioritise more recent samples for faster adaptation. The adaptive anomaly threshold is demonstrated to be superior to fixed and other threshold determination methods from the literature. Note that the use of labels to find the anomaly threshold results in a semisupervised method.

Active Stream Framework
The proposed autoencoder anomaly detector is a semisupervised method requiring class labels to be known. Class annotation is also important to detect changes in the data stream that require learning to occur in order for the model to adapt. Given the infinite nature of a data stream, labeling all samples is infeasibly expensive, therefore AL methods were explored to minimise the labeling cost for both updating the model and threshold, whilst identifying and adapting to changes in the data stream.
Zliobaitė et al. [3], proposed an active stream framework, which combines change detection with a labeling strategy and a fixed budget B. Algorithm 2 gives the active stream framework evaluated in this research. The active learning strategy is an important part of the framework as it determines whether or not the current data sample X i , y i should be labeled. Blind, random, uncertainty, variable uncertainty and split strategies were evaluated in this research [3], [21], [22]. The framework maintains a running estimate of label usageû i over a fading window, calculated by equation 3, where w is the size of the fading window and label i is the labeling decision either 0 or 1 at time i. The spending estimateb is then calculated fromû i over w, given in equation 4 [3]. During this evaluation, w was set to 1000.
The labeled samples are then used to train the model and perform change detection. If a warning signal is received then a new autoencoder (AE L ) is trained with the most recent examples, and when a change is signaled, the current model is replaced with AE L , completing adaptation to the new concept. For this evaluation the Drift Detection Method (DDM) [25] change detector was used.

Active Learning Strategies
The following section outlines the active learning strategies evaluated in this research.Žliobaitė et al. [3] outlined three objectives of active learning strategies, which will need to be met by any proposed strategies: 1) balance the labeling budget B over infinite time; 2) detect changes anywhere in the instance space; 3) preserve the distribution of incoming data for detecting changes.
A random active learning strategy randomly selects a sample to label based on Bernoulli probability with a given budget B. The random strategy satisfies all three objectives of [3].
The uncertainty strategy labels a sample based on the level of uncertainty from the classifier compared to a threshold, and attempts to label the samples where there is the least confidence [3]. A common approach is to use the classifier's predicted probability for class c compared to the threshold θ: P (y c |X) ≤ θ [3], [21], [22].
Autoencoders do not provide a direct class probability, instead they provide a reconstruction error from which a normal or anomaly classification decision can be made. This research proposed a novel method whereby the RE squared difference from the anomaly threshold φ is used as a measure of uncertainty, equation 5, assuming the hypothesis that the lower the difference compared to the average of the population, then the greater the uncertainty for the sample. The difference is squared to make all values positive.
In order to accommodate changes in the data stream and avoid a scenario where the strategy stops learning due to high variance, a fading factor α was used to produce a fading average of differences d avg , calculated using equation 6. This allowed for the more recent samples to have a greater bearing on the strategy outcome.
Using d avg the fading standard deviation d std of the stream was calculated using equation 7.
Finally, the strategy returned a labeling decision of 1 where d i < d avg − d std θ, equation 8, requiring a sample to be below the average by so many θ standard deviations, where θ was the confidence threshold. θ = 2 should capture samples where the difference is the lowest 5% of all samples. The uncertainty strategy algorithm is given in 3, whereby the autoencoder AE model is used to predict the RE for sample X i , and the fading average and standard deviation of the difference from the anomaly threshold φ over the stream used to provide a label output of 0 or 1 based on equation 8. On its own, an uncertainty strategy cannot satisfy all three active learning objectives as: the number of labeled samples will depend on the amount of uncertainty within the data stream and could vary above the intended budget, this is instead limited by line 2 of algorithm 2; only samples within the uncertainty margin are labeled, changes occurring outside of the margin will be missed; and change detection will be based on the distribution of uncertain samples that are trained on [3]. The strategy should reflect regions where real concept drift is occurring as higher uncertainty could reflect a change, resulting in faster adaptation times [21], [22].
Variable uncertainty is based on the uncertainty strategy, but instead of using a fixed confidence θ, this is instead varied depending on the amount of labeling that is being requested from the strategy, so that more labels will increase the confidence and fewer will decrease to attenuate the labeling and better manage budget [3]. This approach also has the benefit that it is not limited to a fixed labeling ceiling

Algorithm 3: Uncertainty Strategy
Input : Confidence θ, Fading Factor α, X, autoencoder AE, Threshold φ Output: label and can better utilise higher budgets to accurately identify concept drift [22]. Similar to the uncertainty strategy this also does not satisfy all three requirements [3].
The split strategy, given in algorithm 4, combines the random and variable uncertainty strategies to benefit from their respective strengths of accessing the entire stream distribution for change detection, and adapting to potential change in higher regions of uncertainty. Due to the incorporation of the random strategy, this also meets all three requirements of [3].

Algorithm 4: Split Strategy
Input : Label Budget B, Confidence θ, Fading Factor α, X, autoencoder AE, Threshold φ, Step s Output: label 1 label ← 0; 2 if randomStrategy(B) = True then 3 label ← 1; 4 else if varUncertaintyStrategy(θ,α,X i , AE, φ,s) = True then 5 label ← 1; The proposed Split Active Learning Anomaly Detector (SALAD) method is depicted in figure 2. This method reduces the labeling cost of the data stream to a fixed budget by adopting an active learning strategy to determine which labels should be updated, satisfying the requirements of Zliobaitė et al. [3]. Labeled samples are used to train the anomaly detector and the predictions input to a change detector which monitors for real concept drift occurring in the data stream [2]. Where real concept drift occurs, the current anomaly detector is replaced with a new one that has been trained on samples since a warning signal was produced. The result of this method is faster training of the anomaly detector and the ability to quickly adapt to changes occurring in the data stream.

Adaptive Anomaly Threshold
The accuracy and F1-score of the Adaptive Anomaly Threshold method was compared to the Stochastic Anomaly Threshold with memory (SAT FF), HAT and NB algorithms. SAT FF is a novel modified version of the SAT algorithm to update the threshold based on a fading average [30] of previous thresholds to allow for memory when processing over a data stream. The parameter values for the autoencoder methods are given in table 1, where p represents the dropout probability; l is the number of hidden layers, h the ratio of hidden units to visible units; opt is the optimiser used to train the network with α learning rate; β is the threshold sensitivity; α is the fading factor; and v is the step size. NB and HAT algorithms used the scikit-multiflow default parameters [29]. close to HAT in terms of mean performance, with better kappa and F1 metrics when taken as an average across all batches, as shown in table 2. SAT FF and AAT were also significantly faster with a total running time (RT) of 14.04s and 19.18s, compared to 510.93s and 794.76s with NB and SAT, respectively. Note that running time will vary based on the underlying system performance and frameworks used, however the time of SAT FF is an order of magnitude better compared to both NB and HAT algorithms. Overall AAT returned the best mean accuracy and kappa results, an important metric for data stream learning.  As demonstrated in our previous work [31], the UNSW-NB15 data set proved to be more challenging for on-  Table 3 gives average accuracy of the SAT and SAT FF algorithms as 70.39% and 62.96%, respectively, which is considerably lower than that of NB and HAT. AAT returned the highest overall accuracy of the anomaly threshold methods, at 86.31% with 3 layers and dropout probability of 0.2, although kappa was lower, demonstrating reduced confidence in the anomaly decision for all methods. The results show that AAT is able to provide near equivalent performance to NB and HAT methods with a significantly lower running time.

Labeling Budget
The effects of the labeling budget was evaluated with the random strategy as this is the only strategy to maintain   5 and mean accuracy plotted against the blind adaption AAT approach for comparison in figure 5. The greater the labeling budget, typically the higher the accuracy, kappa and F1 scores, the exception being UNSW-NB15 where B = 0.5 has a slightly higher accuracy and kappa. The difference in accuracy between 20% and 100% labels is 0.76% (KDD'99) and 2.69% (UNSW-NB15), demonstrating a small loss in performance for an 80% saving in labeling cost and approximate running time reduction of 54-62%; this reflects the results ofŽliobaitė et al. [3], where a small loss of accuracy was observed between a B of 100% and 10% when tested with a number of non-cyber data sets.
Comparing to the blind adaptation of previous experiments, whereby no active learning is used, a labeling budget of 0.5 achieved a higher accuracy and F1 for half the labeling cost on both data sets. ASF RAND 1.0 is equivalent to the blind approach with full labels, but with the addition of change detection, with average accuracy and F1 improved across both data sets, although lower towards the end of the UNSW-NB15 stream as shown in figure 5b. Note the lower running time of the blind approach due to use of a chunk size of 100 vs 10 which influences the number of gradient updates and hence training time of the network.

Active Learning Strategies
The results of each active learning strategy with a budget of 0.2 (20%) are given in Table 5, with accuracy and F1score for both data sets plotted in figure 6. Each strategy was executed 5 times with the average and standard deviation presented. The worst performing strategy was the fixed uncertainty strategy, reflecting the results ofŽliobaitė et al. [3], which was expected as the algorithm is biased only towards uncertain samples and cannot vary the amount of samples labeled, meaning that change occurring outside of the fixed margin will be missed. It is also possible that the RE=value of normal samples outside of the margin may increase as the AE is trained more on uncertain samples, leading to higher false positives and lower F1-score. The split strategy, returned the best results across both data sets, combining random and variable uncertainty strategies. Note that the total running time is between that of the random and variable uncertainty strategies, indicting time complexity savings where uncertain samples were first selected by the random strategy. The Kappa of the split strategy was observed as 0.717 (table 5) for the UNSW-NB15 data set, this is much higher than the performance of the blind AAT, NB, HAT and other AL strategies, indicating a higher level of confidence in the anomaly decisions.

DISCUSSION
This research evaluated online anomaly detection in the form of a prequential evaluation method whereby the model is first tested on the next sample or chunk in the stream before training. The anomaly threshold is a key parameter for anomaly detection and finding an optimal threshold  for a data stream is non-trivial. A number of methods for finding the threshold were compared including fixed, naïve, stochastic and adaptive techniques. The adaptive anomaly threshold (AAT) was introduced as a novel hybrid of the naïve and stochastic methods in order to better adapt to chunks of normal or anomaly samples based on initial observed accuracy. Overall AAT outperformed other methods and is a recommended contribution of this research to be explored further.
The results observed with the KDD'99 data set and AAT threshold method provide strong evidence that the hypothesis of effective anomaly detection for network data streams can be supported by the autoencoder method with both strong detection and run time performance compared to traditional methods. UNSW-NB15 results could be strengthened by further design choices.
The AAT method makes use of blind adaptation, whereby the model is trained on all labeled samples. This has the drawback of high cost due to full labels and slow adaptation times to change occurring in the data stream. The research further explored change detection and active learning strategies, as outlined byŽliobaitė et al. [3], to further improve performance for a lower overall cost.
An ASF framework was implemented along with the random, uncertainty, variable uncertainty and split active learning strategies. With the uncertainty strategy, a new method for AE was proposed, whereby the average RE difference from the threshold is used as a baseline to detect samples with high uncertainty, defined as being in the proportion of the population with the smallest difference, tuned by a confidence parameter.
The use of ASF demonstrated that better accuracy, kappa and F1 scores can be achieved, compared to blind adaptation, with just 20% of the labeling cost, enabled by active learning of the most important samples to accelerate the learning process [3]. The results align to those presented by Zliobaitė et al. [3], with a split strategy being recommended as this fulfills all three active learning requirements to maintain a fixed budget, access to all samples within the stream and preserve the distribution of incoming data for detecting changes. UnlikeŽliobaitė et al. [3], this research recommends inclusion of the uncertain samples with the change detection to improve per class performance.

CONCLUSION
The aim of this research was to explore semi-supervised online autoencoder methods for the task of anomaly in- trusion detection on non-stationary network data streams, adapting to concept drift over time, with minimal labeling cost, by adopting an active learning change detection strategy. A unique contribution of this research was to compare a selection of anomaly threshold methods, proposing memory adaptations for data streams and a hybrid Adaptive Anomaly Threshold method which demonstrated superior performance. One of the more striking findings of the research is that the processing time of the autoencoder anomaly detector method is significantly lower when compared to traditional online learning techniques, making it well adjusted for high speed online network data streams, demonstrating an ability to detect an equivalent number of cyber attacks to traditional online learning methods, in a significantly reduced time frame. An area of future research would be to explore alternative threshold methods, such as clustering, which may allow for better identification of classes that overlap with normal samples and multi-label classification.
A further contribution of this research was to evaluate the autoencoder method with an Active Stream Framework, allowing the labeling cost of the data stream to be significantly reduced to a budget of 20%. A novel variable uncertainty strategy was proposed for autoencoders where the posterior probability is not available, instead tracking the distribution of sample RE distances from the anomaly threshold to determine uncertainty. An area of future research should be how to efficiently annotate samples, possibly by unsupervised clustering methods such as those demonstrated by [32].
Overall this research has demonstrated that the proposed Split Active Learning Anomaly Detector (SALAD) method can demonstrate high levels of performance with network data streams, which significantly reduced the labeling cost. The results are not perfect however, and it would be recommended to combine in a hybrid intrusion detection model whereby misuse detection is used before or after the anomaly detector to further identify classes, reduce false positives and better identify minority classes. Multi-label classification would be a further research area to expand on this work and provide additional context to detections.