Auto-Threshold Deep SVDD for Anomaly-based Web Application Firewall

.


I. INTRODUCTION
The number of attacks on applications, which provide services on the Internet, have dramatically increased.Web application firewall (WAF) is a special type of application firewall applied to applications for protecting, monitoring, analyzing, and filtering HTTP traffics.Websites and web applications use hyper-text transfer protocol (HTTP), which are subject to a number of attacks such as XSS and SQL injection, to deploy internet-based communications.To detect and prevent attacks on those applications WAF plays an important role.
As a solution, WAF applies a set of previously defined rules as signatures, which allows for the detection of vulnerability and malicious HTTP requests.Due to the changing nature of attacks when new attacks were discovered, the signatures must be updated.Due to the changing nature of attacks when new attacks were discovered may have had adverse effects on the applications.Therefore, the signatures must be updated.In preventing this problem, known as zero-day attacks, researches have suggested technical approaches based on anomaly detection that determines malicious patterns that appear in the HTTP request.The foremost problem is that the malicious patterns in data do not conform to a welldefined notion of normal behavior.It means the requests will be rejected, provided that the anomaly score exceeding a predefined threshold.One solution to overcome this problem is anomaly detection, which is able to detect zero-day attacks.
To overcome this problem, several machine learning techniques are presented in anomaly detection for several applications.Usually data patterns do not conform to the definition of common statistical analysis.Therefore, it is necessary that anomaly detection approaches like machine learning methods Manuscript received December 1, 2012; revised August 26, 2015.Corresponding author: M. Teshnehlab (email: teshnehlab@eetd.kntu.ac.ir).
appropriately trained in these domains.Recently Neural networks have been used by researchers that show have some key advantages in dealing with high-dimensional data [1], [2].
Several applications of anomaly detection domains such as detecting malicious URLs often have high-dimensional or even ultrahigh-dimensional data [3].Thus, we must apply feature engineering that can model a suitable representation to preserve the relevant features [4].Recently, deep learning techniques show that they have impressive performance to capture complex structures in the high dimensional data be applied to anomaly detection [5].As for HTTP we face a complex hypermedia data, which includes: textual information that is encoded in ASCII format, and can span over multiple lines and be including start line, HTTP header flags, blank line and HTTP body.Thus for analyzing non-stationary and non-structured data we have to use text mining and natural language processing (NLP) techniques to extract features, but these data can be too high dimensional to be learned by traditional machine learning method [6].So we encounter a deep model as a powerful feature extractor to relief this problem.
Briefly, we introduce the main contributions of the proposed algorithm in this paper under the following consecutive tasks: 1) Using end-to-end Deep support vector data description (Deep SVDD) to obtain high-level normal feature vector.This approach is a deep anomaly detection inspired by kernel-based one-class classification and minimum volume estimation [7].2) Obtaining threshold during Deep SVDD learning process.For this purpose we introduce a novel cost function that gives an advantage of discriminating anomalies/normal points.3) Creating a feature set using n-gram and one-hot methods.In addition, Extracting an one-hot feature set using convolution neural network (CNN) from HTTP requests.The remainder of the paper is organized as follows.In Section II, we briefly describe several related works including one-class classifier for anomaly detection algorithms.Then we review deep learning models for anomaly detection, and we highlight related works in WAF based machine learning, especially deep neural networks.The proposed method are presented in Section III.This is followed by Section IV showing the evaluation and experimental results.Then, the discussion of comparing different methods is presented in Section V. Finally, Section VI draws conclusions and suggests future work.

II. RELATED WORKS
Many articles have reviewed anomaly detection in several applications [2].Also, the survey [5] studied deep anomaly detection techniques and particularly intrusion detection systems [8].However, we first briefly discuss one class classifications that play major role in anomaly detection predictions.Consequently, considering the importance of deep learning in artificial intelligence fields, we review works on neural network and deep anomaly detection.Then, we present solutions for WAF based on machine learning and especially deep learning.

A. One-class classifiers 1) Reconstruction error (RE)
Auto-encoder (AE) is an unsupervised neural network, whose objective is to learn to reproduce input vector x as output x such to minimize the reconstruction error, which is defined in (1) for a D dimensions entry [9].
One of the important applications of the AE is anomaly detection [10].AE represents features in new space that estimate regular data features in the small number of neurons in a hidden layer.As the AE parameters are fitted and adjusted based on the normal data, these have lower reconstruction errors than abnormal instances.Therefore, the higher the reconstruction error, the higher the possibility of that data point being an anomaly.
2) Density Estimation Density-based techniques perform anomaly detection by estimating the density of training data.One of the density estimation approaches that can be used for constructing an anomaly detection model is kernel density estimation (KDE) [11].In statistics, KDE is a non-parametric way to estimate the probability density function of a random variable.Let {x 1 , x 2 , ..., x n } be an independent and identically distributed sample drawn from some distribution with an unknown density f .We are interested in estimating the shape of this function f .Its kernel density estimator is according to (2).
where K is the kernel -a non-negative function -and h > 0 is a smoothing parameter called the bandwidth.A kernel with subscript h is called the scaled kernel and defined as

3) one-class support vector machine (OCSVM)
The method of SVDD takes a spherical, and not a planar, approach [12].Another approach of OCSVM classifier separates all the data points from the input feature space and maximizes the distance from this hyperplane to this space according to quadratic programming minimization functions as in (3).
(ω, ρ) is a weight vector and an offset parametrizing a hyperplane in the feature space associated with the kernel.

4) Ensemble isolation forest (IF)
This method partitions data using a random tree until all instances are isolated from each other.Firstly, the algorithm builds isolation trees using sub-samples of the training set and, later, it calculates the isolated path length of the observation through isolation trees to obtain the anomaly score for each instance.The anomaly score is defined as (4).
where h(x) is the path length of instance x, c(n) is the average path length of unsuccessfuls search in the tree, the E(h(x)) is the average path length of every instance across different trees, and n is the number of data points in a chosen sub-sample.
To building a tree, the algorithm selects a feature randomly; and it selects a split value in the values range of the selected feature in order to isolate observations [13].

5) Elliptic envelope
This method tries to find the shape of the data and mark data as anomalous that are far from the shape.Thus, it fits the smallest volume ellipsoid using a robust covariance estimation to the data by modifying the free parameter contamination [14].The main task of the covariance elliptic envelope is finding observations whose covariance matrix has the lowest determinant.

B. Neural network and deep anomaly detection algorithms a)
Hybrid deep neural networks(DNNs) and one-class classifier: Since the problem, known as the 'curse of dimensionality' is an obstacle for many anomaly detection methods, this technique uses DNNs as feature extractors.The compressed and extracted features from the hidden representations of the auto-encoders or restricted Boltzmann machines (RBMs) are input to traditional anomaly detection algorithms to detect outliers [15].For example, according to Figure 1(a) auto-encoders reduced features and then, one-class classifiers such as one-class support vector machine (OC-SVM) [16], kernel density estimation (KDE) [17] detected outliers, or RBMs applied to extract features and then, OCSVM detected anomalies [16].b) Self-representation: As we mentioned, for anomaly detection, the AE are trained on the normal data [18].AEs play a fundamental role in unsupervised learning and deep architectures for transfer learning and other tasks.Figure 1(b) shows that the reconstruction error must be compared by a threshold value in a test phase.When layers become deeper, the network can be able to extract more complex features than traditional shallow networks.In stack auto-encoder (SAE) techniques, where each layer trains one after another, the structure of SAEs is stacking AEs into hidden layers by an unsupervised layer-wise learning algorithm.
c) Variational auto-encoder (VAE): VAE is one of the generative machine learning techniques, which deals with models of distributions of data points in some potentially highdimensional space [19].The objective function of a VAE is the variational lower bound of the marginal likelihood of data since the marginal likelihood is intractable.The reconstruction probability from the VAE can be considered as an anomaly score [20].A shrink regularization added to the loss function of auto-encoder, which is designed to penalize normal data points whose vectors in the latent space are a large magnitude that is it will restrict the normal data to lie close to the origin [21].Then VAE attempts to encode data that are distributed as a standard Gaussian in the latent space (Figure 1 Therefore, the model trained adversarially with the ability to encode and reconstruct instances according to Figure 1(d).One section i.e. generator outputs random samples from random latent vector and another i.e. discriminator that classifies normal and outlier samples [22], [23], [24].
e) SVDD and Deep SVDD: This method obtains a spherically shaped boundary around normal data sets.The aim is to minimize the volume of the hyper-sphere by minimizing squaring of the radius and with the constraint that the sphere contains all training samples [25].Furthermore, the Deep SVDD learns a neural network transformation that attempts to map most of the data network representations into a hypersphere characterized SVDD conditions (Figure 1(e)) [7].In other study, the one class neural network (OC-NN) is a neural architecture using a loss function based on OCSVM [26].

C. WAF based on anomaly detection
The first study of intrusion detection based on web application security introduced by Kruegel and Vigna [27].They analyzed statistical characteristics of query and parameters of HTTP requests.Also, they extend their works in the multimodel approach in 2005 [28].Many types of research apply machine learning to investigate this issue.Thus, in this section we introduce a number of related studies.
a) WAF based on traditional machine learning: The affinity propagation (AP) as an unsupervised method is used on real HTTP traffic streams [29], [30].They used the character distribution of each source in HTTP requests as the attributes.Due to the importance of minimizing false positives, probabilistic automaton such as Markov models has been used in many studies to detect web attacks [31], [32], [33].First, HTTP requests are tokenized, then those tokens are fed to the deterministic finite automata (DFA) algorithm [31].While papers [32] and [33] use Markov chain model based on other features constructing from n-gram method.Furthermore, the extreme learning machine (ELM) as one of the neural network models [34], and binomial logistic regression [35] classified normal and anomaly web traffic on tokenized retrieved requests.In addition, probability distributed model, Hidden Markov Model (HMM), and one-class support vector machine (OCSVM) model detected web attacks on attribute values [36].
b) WAF based on deep learning: Recurrent neural network (RNN), SAE, DBN, and CNN are various categories of deep neural networks that are used for web application firewalls.In [37] character-level CNN are used for web attacks detection.They used supervised learning for anomaly detection that classified training data by normal data and attacks.Many researchers have applied SAE to implement WAF in [38], [39], [40] and denoising SAE in [41] for web attacks detection that is applicable for extraction and dimensional reduction of features vectors.In addition to using SAE, we used DBN model for feature extraction in [16].We then used three oneclass classifiers called OC-SVM, ensemble isolation forest, and covariance elliptic envelope for the detection of attacks.Then they compared different methods with each other.The deep-based RNN includes Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) used for WAF in [42], which can be efficient for sequential and time-dependent data such as HTTP requests.This model takes uniform resource locators (URLs) in the HTTP GET requests as the input.Subsequently, the URLs are tokenized by following a particular strategy: First, two RNNs learn the normal request patterns and outputs their familiarities with given URLs.Then a trained neural network decides whether given requests are anomalous or not based on the output of the RNNs.Also, we proposed an auto-encoder LSTM (AE-LSTM) architecture for feature extraction HTTP data to detect attacks in WAF [43].Instead of fully connected layers in SAE, the LSTM layers can extract sequential data.Thus, the one-hot model of HTTP is also used in the AE-LSTM model.In addition, ensemble isolation forest is used on last extracted features to detect attacks.Also, C-LSTM consists of CNN and LSTM layers are presented to model spatiotemporal data contained in traffic data effectively as one-dimensional time series signal [44].
The one-class classification algorithms are widely used for anomaly detection tasks.For this ability, one-class classifier is used to identify unknown attacks in WAF.This is the key reason that we use anomaly-based techniques for WAFs.In addition, the HTTP data is semi-structure data with non-stationary sequences.Therefore, feature engineering techniques like NLP approaches and unlike HTTP tokenizer, is necessary to consider data generality.Moreover, as recent researches on WAF based on anomaly detection have worked on layer by layer feature extraction, the proposed algorithm uses an end-to-end technique for extracting features.In the proposed method we used Deep SVDD model to learn normal requests in the training phase, then the trained model predicts anomalous requests, which have never been seen in the training step.The anomaly scores, which is derived from the value of cost functions can be used to determine anomalous when attacks occur within communications.Additionally, we modify the cost function to generate a threshold automatically to define boundary of normal and attack requests.
An abstract view of the main steps of the proposed model includes feature engineering, one-class classifier, and auto threshold generation are shown in Figures 2 and 3.

A. Feature engineering
The representation of data can have an impressive effect on machine learning models.This feature engineering includes construction and extraction efficient of information from the data.In term of the data type, the HTTP data can be considered text as a sequence of characters, words, and so forth.Therefore, the textual data must be mapped to real valued vectors and transform tokens into features.
One of the simplest techniques to numerically represent text is in counting of occurrence the particular token in a HTTP text.However, this scenario loses item order.The items can be character, word and etc.For this, discrete text mining techniques such as n-gram and one-hot methods that consider the order of items are utilized.

1) Bigram
In the first scenario, bag of words can be used.In this method items are counted separately throughout the text.Items depend on counting strategy, for example we can count Fig. 2. ATDSVDD-bigram: The scheme of proposed anomaly detection method for WAF based on bigram Fig. 3. ATDSVDD-onehot: The scheme of proposed anomaly detection method for WAF based on one-hot alphabets or words.In this scenario because the nature of the sequential text structure may not have valuable information on counted items.n > 2-gram is the strategy to count items considering sequence nature of texts.The parameter n simply refers to the number of items and the n-gram used to capture statistical information from data set, which produces lists of sequences of items.On the other hand, the bigram model i.e. n = 2 in n-gram, approximates the probability of token given all the previous tokens by using only the conditional probability of one preceding token.Then, when we use a bigram model to predict the conditional probability of the next item, the probability of an item depends only on the previous item, which is also known as Markov assumption [45].Furthermore, we can further generalize the bigram model to the trigram model (n = 3) and so forth.As n increases, the amount of data required for modelling is increased.In addition, the main drawback of this method is the growing size of feature space exponentially as growing amount of n.Choosing a large amount of n leads to ill-posed problems.
Using of two benchmark corpora i.e. n-gram character-level and word-level strategies, provides evidence that characterlevel information has better performance to discriminate between spam and legitimate languages [46].The most important property of the character-level n-gram approach is that it avoids using of language-dependent tools such as tokenizer and also reduces the need of a lot of storage space.For this purpose we define a matrix v with m × n dimensions m, n ∈ N. v ij is equal to the frequency of occurrences of the j th bigram item in the i th request where i ∈ {1, 2, ..., m} and j ∈ {1, 2, ..., n} respectively.

2) One-hot
A one-hot encoding is a representation of categorical variables.In natural language processing, a one-hot vector based on character is a 1 × k vector is used to distinguish each character by its alphabet.All elements of vector are 0s except for one of the cells that indicates a special character.For this scenario, we can stack these one-hot encoding to preserve ordering of items.Due to the high cardinality-categorical variable problem, we consider the common model called character-level ConvNet [47] instead of word-level one-hot representation.Thus, in the present work, we train a convolutional neural network (CNN) with multi layers of convolution on top of one-hot vectors.Addition to parameter sharing, CNNs utilize layers with convolving filters that are applied to local features [48].

B. Deep SVDD as one-class classifier
[7] introduced a novel anomaly detection method (Deep SVDD), which is trained on normal examples based on minimizing distances of features map to a center c.Minimizing the volume of the hyper-sphere forces the network to extract features as the network must closely map the data points to the center of the sphere.Deep SVDD utilize end to end unsupervised approach, which maintains a low intra-class variances in the feature space for the normal class.
The Deep SVDD objection function is as follows: Where F ⊆ R d and X ⊆ R p are input and output space respectively, and c ∈ F is the center of output space.Also W = {W 1 , W 2 , ..., W l } is considered as the weight of the neural network with l ∈ N layer.Considering φ as a network, let φ(.; W ) ∈ F is output of neural network, the objective function tries to minimize the distance of φ(.; W ) and center c, which leads to enforce the network to extract features, which is mapped to a vicinity of c as finding a data-enclosing hypersphere of smallest size.Hyper parameter c can be initialize by average of φ(x; W ) where inputs space x are normal training data.
For a given test point x ∈ X , an anomaly score function R(x) can be defined as follows: Where W * are the network parameters of a trained model.As the discrimination of anomalous and normal data must be able to define a threshold R, then we can define a function to discriminate attack from non-attack requests.The function can be demonstration as bellow: Whereas, normal examples of the data are closely mapped to center c, and anomalous examples are mapped further away from the center.Thus, R determines hyper-sphere boundary of normal and abnormal data.

C. Threshold
c and R are main characteristics that determine the locus of any hyper-sphere, Although c can be initiated by calculating the mean of the network outputs φ(x, W ) [7], attaining appropriate R as a threshold is a challenging problem.
As mentioned in III-B to predict on a trained network over test data we have to calculate a threshold from training data, but because we do not have information about scattering distribution of output (φ(., W )) in the hyper-sphere, there is no standard manner to calculate an appropriate R (R * ).For example we can assume that R * is an average of all scores on training data or in another scenario R * can be maximum score.This threshold does not prepare a confidence for the perspective of suitable results before being tested.
To deal with the threshold problem, we try to the change scattering distribution of data in the hypersphere.For this purpose we present a novel cost function to force the network to extract features to prepare both minimization distance of φ(.; W ) and c and keeping the data scattering distribution, in the Gaussian distribution with center c and standard deviation σ. σ also is updated in every step on optimization Fig. [2,3].This enforce mapped features to drop in σ area around of c.Therefore, this property enables us to define a reliable threshold by statistically analyzing output cost.According to this case we can propose the threshold R * = c + 3σ to assure that 99.7% of the training data with the least scatter around the c, fall inside of R * aria.As a result, in this case we have two parameter c and σ leads to obtaining R * , whereas in euclidean distance measure we have only c and hence cannot calculate appropriate R.
According to the mentioned strategies we present the cost function as follows: Where σ change in every optimization step.This cost function tries to minimize hyper-sphere considering by mapping the feature space to the Gaussian space.The optimization pseudo-code demonstrated as follow: Optimization algorithm tries to maximize similarity between x and c considering current σ in each step.After that we update σ according to the new feature space that was optimized in the last step and that leads to the decrease of σ.This is repeated until the minimum σ that the feature space can fall into is determined.
Another challenge that we tackle is the anomaly score problem.Due to the nature of Euclidean distances of defining set of numbers that are not bounded, anomaly scores can not be defined properly.Basically, we do not know from its size whether a coefficient indicates a small or large distance.According to cost function (8) that is bounded between 0 and 1, whereas, for the Euclidean distance we have not an obvious bound for maximum distance, we can define a suitable anomaly score function that can determine the probability of the anomaly.For this purpose the anomaly score can be defined as: One of the problems in threshold handling is determining a threshold to ensure that whether it is suitable for our application or not.Choosing an appropriate threshold depends on the application that we can handle by the coefficient of σ parameter in our model.In Gaussian distribution, σ shows how distribution can stretch/scale.About %68 of normal values, drawn from a normal distribution, are within 1σ away from the mean.About %95 of the normal values are within 2σ, and about %99.7 of those are within 3σ that can be shown by ασ.As we increase our α we can expect to decrease TPR and FPR where FPR is more desirable for our application in a reasonable TPR.Conversely, when we decrease our alpha we expect to increase TPR and FPR.

D. Proof
Deep SVDD [7] claims that a hypersphere with optimum radius R * exists in which all of the points fall as much as possible.The important point is that the optimization algorithm does not consider any distribution for data in the hyper-sphere, hence after training, we have to apply some statistical computation such as max, min, or quantile for obtaining a threshold to use it in the test phase.If there is a skew in the distribution of points, we have difficulty obtaining threshold (for example Figure 4(a)).For tackling this problem, first we change the optimization equation φ( to force model to extract features that distribution of points in the hypersphere follows Gaussian distribution (such as Figure 4(b)).For a constant σ, we know will be minimized when φ(x; W ) − c 2 is minimized like deep SVDD which is optimal.
Second, we consider sigma as a control parameter to obtain suitable threshold.for this purpose we change 1 − σ → σ * , then for obtaining σ * we gradually update it into minimum σ where points can fall down into.For this purpose we design a iterative algorithm that first, we consider σ as a constant value and minimize 1 − e − φ(x;W )−c 2 2σ 2 then update σ into σ * using iterative stochastic algorithm [49] demonstrated in Figure 5, as follows: 3) Update σ i using proportion of σ i+1 by equation as follows: that m is batch size and n is number of data 4) If σ i < ( is a small number), go to line 1, else terminate • If following condition are met, we can expect that σ converges to σ * : To evaluate the effectiveness of the proposed framework, we compare it with the state-of-the-art methods for WAF systems based on anomaly detection criteria.Since stacked auto-encoders and one-class classifier can be used for feature representation and anomaly detection respectively, in all studied methods, we first summarize other WAF systems based on anomaly detection and then we present the datasets, evaluation criteria, and experimental results.

A. Datasets
We evalute our experiments with two public datasets, CSIC 2010 dataset [50] and ECML/ PKDD 2007 dataset [51].The datasets contain only HTTP traffics in order to train the anomaly detection models by normal traffic set.
The training sets of CSIC and ECML contain 36000 and 20000 normal data, respectively, and test sets contain 36000 and more than 15000 normal data also, more than 25000 and 15000 abnormal data, respectively.
The HTTP requests, in addition normal to traffics include several attacks such as SQL injection, cross-site scripting (XSS), buffer overflow, files disclosure, XPATH injection, LDAP injection, information gathering, parameter tempering and so on.

B. Evaluation criteria
The fit generalization is necessary for anomaly detection especially WAF systems.The generalization is a trade-off between false positive and false negative rates [31].Therefore, (b) exponential boundaries (a) spherical boundaries The final projection of data points in (a) has a skew that causes having difficulty obtaining a threshold in a spherical manner.This threshold either doesn't cover all normal points in its space -when we choose a small threshold-or not to be able to fit those in its space when we choose a larger threshold.Conversely, In the Gaussian cost function, the final projection of data will be mapped at the Gaussian space -because it forces the model to extract features fitting to this cost function-therefore it allows us to choose threshold appropriately.  in addition to accuracy, we also use several measures to evaluate the performance of methods according to:

C. Implementation setup
Experiments were carried out on a system with the following specifications: X86-64 GNU/ Linux, Intel ® Core ™ i7-7700k CPU @4.20 GHz, GeForce GTX 1080 Ti GPU, and 47 GB RAM.Additionally, algorithms were implemented in Python 3.6 programming language with Tensorflow and Keras libraries.These libraries are useful tools to design, build, and train deep learning models, which have been developed by Google [52], [53].
As mentioned in sections III-A1, III-A2 we choose two input features (bigram and one-hot) for network training.For this purpose we use 96 ASCII codes from 256.They are sufficient for examining individual HTTP requests [54].Therefore, the bigram feature vector is represented by 96 × 96 = 9216 dimensions.Also, the one-hot vector dimension is 96 multiplied by the length of request (total characters) in individual HTTP request.The maximum length that we consider is 2500.
As for the bigram and one-hot feature space we use two types of architecture.For one-hot feature we use the model character-level ConvNet model [47] described in table I and bigram feature space uses the deep MLP model with 4000, 1000, 400, and 100 neurons for each layer respectively.In addition, the activation functions for our architectures used in each layer are LeakyReLU [55].Furthermore, the batch size for all methods is 64 and the number of epochs is 30.

D. Analysis of feature construction methods
In this section evaluation are analyzed for each feature method (bigram, one-hot) and then compare them with each other on CSIC and ECML datasets.Also, we consider the Deep SVDD learning rates 10 −7 and 10 −5 for bigram and one-hot feature models, respectively.
Table II reveals bigram method has better performance that one-hot considering criteria evaluation that metioned in section IV-B.Fig. 6 and 7 shows ROC curve for each method on CSIC and ECML respectively.It also can show that the bigram method obtains higher value than the one-hot method.Thus, our preferred method, which is chosen for further investigation is the bigram method.

E. Analysis of methods comparison 1) Methods compared
In all approaches, we use bigram method to construct features from HTTP requests.In addition, since the bigram featuring size is large, according to the following approaches, we use the stack auto-encoder (SAE) to extract efficient features for anomaly detection.
a) SAE-RE: In this method, the reconstruction error of AE is calculated [40].The anomaly score is the reconstruction error of the last layer that is the accumulated of all errors.The The paper [40] computed the threshold based on the average and standard deviation reconstruction error.
b) SAE-KDE: In this method, the density estimation is used on the last hidden layer of SAE [17].The Gaussian kernel with bandwidth equal to hidden − size 2 is used for KDE.In this method, like the previous method, the appropriate threshold is a crucial parameter.
c) SAE-OCSVM: The OCSVM estimates the support of distribution on the last layer of SAE [16].This method requires the choice of a kernel and a scalar parameter to define a frontier.The parameter ϑ, also known as the margin of the OCSVM, corresponds to the maximum proportion of outliers in the training data and this value is set by the user itself.
d) SAE-IF: The IF measures normality based on decision function on the last layer features of SAE [38], [16].
e) SAE-Elliptic: The elliptic envelope estimates the Gaussian distribution on the last layer data of SAE [16]. 2

) Experimental results
Results of our models against other methods are listed in Table III for CSIC and ECML data.Also, we can compare ROC curves of different methods in Figures 8 and 9 for CSIC and ECML datasets, respectively.To have the same conditions, the dimension of all methods reduces to 100 features and the learning rate for AEs and SAEs is adaptive in each layer [56].

F. Analysis of threshold strategy
In this section we compare the proposed exponential cost function with Euclidean cost function for SVDD classifier on WAF datasets.As we mentioned, there is no information about scattering distribution of output in the hyper-sphere for  the Euclidean cost function to calculate a threshold.Thus, we roughly assume thresholds based on average, maximum or the quantile of anomaly scores.In obtaining a reliable threshold we design the exponential cost function, which tries to force the output data distribution toward a Gaussian distribution with c and σ parameters.Thus, a threshold can be estimated based on linear composition of average and standard deviation of cost function.Table IV tabulates the comparison of exponential cost function performance with Euclidean cost function performance based on applying various types of threshold.For this purpose, we consider R exp = c + 3σ as a threshold for the proposed cost function and also we define different types of thresholds for Euclidean cost function, which are R mean = mean{scores}, R max = max{scores}, and R q = q − quantile{scores}.The q value in R q is the number of equal-sized partitions of a finite set.We consider q value to be 0.997 in comparison to R exp value, which both methods use 99.7 percent of normal data for calculating a threshold.
V. DISCUSSION In this section, we interpret the experiments and explore the significance by focusing on the generalization of different methods on CSIC and ECML datasets.
For constructing features from HTTP data for attack detection, we use bigram and one-hot techniques that can model token's sequences of data by ordering of characters in sequence of requests.In fact, the difference between bigram and onehot technique is that one-hot model is dependent on length of request.A specific deep model based on convolution neural network used to extract rich features from one-hot features.On the other hand, deep-based approaches are practically one of the best options for large and high dimensional data.
As for generalization, it is a crucial metrics for anomaly detection methods such as intrusion detection systems [31].To optimize the impact of benign attacks in normal data, we need to acquire appropriate generalization.Figure 10 (a) shows a desirable trade-off between false positive and false negative rates.For having desirable generalization, the threshold value plays a important key role, which it can lead to proper boundary to distinguish normal and anomalous data.Since, there are no sense about the output distribution of data that spread in a hyper-sphere, we can not determine a threshold toward a proper trade off between false positive and false negative rates.For example if that distribution is estimated in a skew and torn shape 10 (b), the threshold can not estimate a fit model boundary.
To compare to other models, we contrast the ATDSVDD with stack auto-encoder and one-class classifiers based on accuracy and generalization.For this purpose we examine F1, which is related to precision and recall (detection rate), and AUC can show the performance of model based on sensitivity (detection rate) measure against false positive (1 -specificity) rate measure.Figure 11 compares the generalization of methods with each other by measuring F1 and AUC criteria.It also examines all models based on accuracy measure.
As we see, F1 and AUC show that the proposed model outperforms one for all datasets, as well as depicting that the gain accuracy is higher than 80% compared to other models.Also this reveals that SAE-RE achieved significant results for all data, while SAE-elliptic gained appropriate results only on ECML data.
As a result, the models based on ATDSVDD not only archived significant results based on generalization and accuracy measure, it can perform as an end-to-end feature extractor unlike SAE layer by layer feature extraction.

VI. CONCLUSION
In this paper, we presented a comparison of unsupervised deep neural network based on Deep SVDD and SAE-OC in the WAF framework.The proposed model was examined using two benchmark corpora CSIC 2010 and ECML/ PKDD 2007 and several evaluation measures that provided a comparison between different methods for detecting anomalies.
The results show that ATDSVDD produced higher performance in respect to accuracy and generalization measures.In this paper, we use character-level bigram and one-hot methods to construct the feature vector.Also, we perform CNN as a feature extractor from the one-hot method.Then, for the learning feature space based on normal data, Deep SVDD classification extracts appropriate represented features.Finally, according to exponential normal distribution, we proposed a novel cost function that maps feature space to the Gaussian space.Therefore, we can consider a suitable threshold around normal data by setting the linear composition of average and standard deviation of anomaly score in HTTP requests.
To work on this issue, we deal with two challenges: feature engineering and anomaly detection.For future works, we can define a feature extraction to extract the semantic feature in HTTP request domain, or perform other methods especially,  GAN and VAE models for detection of HTTP attacks.Also to deal with data stream problems incremental learning can be considered.Nowadays adversarial attacks on the model is a great concern in the AI-based security field.Therefore, model robustness to this type of attack should be considered for future work.
(c)).d) Adversarial learned one-class classifier: In this strategy, During inference, the trained model is expected to accept in-class examples and reject out-of-class examples while, in other scenarios, the trained model try to accept only in-class examples.

Algorithm 1 :
ATDSVDD Optimization algorithm input : σ 0 , c, W 0 , number of epochs = epochs, batch size = m, size of training data = n, e = 1, i = 0 output: σ * , W * Initialize α ← m n while e ≤ epochs do while (i + 1)m + i <= n do x i ←i th batch of training data Apply x i on W i :

Fig. 4 .
Fig. 4. Figures show the distribution of output data according to different cost functions in two-dimensional space (a) Euclidean cost function and (b) Gaussian cost function.The final projection of data points in (a) has a skew that causes having difficulty obtaining a threshold in a spherical manner.This threshold either doesn't cover all normal points in its space -when we choose a small threshold-or not to be able to fit those in its space when we choose a larger threshold.Conversely, In the Gaussian cost function, the final projection of data will be mapped at the Gaussian space -because it forces the model to extract features fitting to this cost function-therefore it allows us to choose threshold appropriately.

Fig. 5 .
Fig. 5.This figure shows updating steps in iterative stochastic algorithm true detected number of all instances(N ) recall(DR) = number of relevant attacks detected number of relevant attacks precision(P R) = number of relevant attacks detected number of attacks detected specif icity(Spec) = number of relevant normal detected number of relevant normals f 1-measure = 2 × DR × P R DR + P R (10) Also, we use ROC curves that provide insights closer to false positive rates and true positive rates for different thresholds of scores.

Fig. 6 .
Fig. 6.The ROC curves of CSIC dataset for bigram and one-hot methods

Fig. 7 .
Fig. 7.The ROC curves of ECML dataset for bigram and one-hot methods

Fig. 8 .
Fig. 8.The ROC curves of CSIC dataset for every method

Fig. 10 .
Fig. 10.Generalization strategies for all possible HTTP requests, Normal (green region) illustrates the set of normal requests, cross points indicate different attacks.The yellow regions illustrate the estimated model for anomaly detection.(a) is the desired system, (b) is the over-generalized model, and (c) is the under-generalized model.

Fig. 11 .
Fig. 11.Comparison the accuracy and generalization of methods

TABLE II THE
RESULTS OF PROPOSED METHOD BASED ON BIGRAM AND ONE-HOT

TABLE III EVALUATION
OF CSIC DATA WITH VARIOUS METHODS