Behavioral Malware Detection Using Deep Graph Convolutional Neural Networks

,


Introduction
According to a report published by AV-TEST [1], 9.74 million new malware specimens were released just in September of 2019, totaling 948 million known specimens in the wild.Dealing with the rapid increase of malware in number, complexity, and variability requires the research and development of new intelligent, automatic malware detection methods capable of scaling accordingly.There are two main approaches to detecting malware; static malware analysis, and dynamic malware analysis [2].On the one hand, static malware analysis can be conducted quickly by comparing a set of handcrafted features of the incoming file to previously observed malware features or signatures, which makes static analysis vulnerable to code obfuscation techniques employed by polymorphic and metamorphic malware [3], as well as to complete new specimens of malware or zero-days.Traditional signature-based malware detection methods are the cornerstone of the majority of the commercial endpoint protection systems since they are relatively fast and do not depend on any additional infrastructure to collect and analyze the data; however, they require expert knowledge to reverse engineer malware instances and produce the features that will be used for detection.Evidently, this approach does not scale as fast as malware production.On the other hand, dynamic malware analysis or behavioral analysis is based on behavioral data such as API or system calls, which is harder to obfuscate [4].
In order to collect dynamic analysis data, it is often necessary to run the program in a sandbox environment [5].A sandbox provides a controlled and isolated environment for the guest program to run while monitoring and tracking its activities.Once the data is collected and preprocessed, it can be used to feed behavioral detection algorithms [6].Deep Learning algorithms have shown unprecedented success in various domains such as image classification, natural language processing, and speech recognition [7].Following this trend, Deep Learning algorithms have also been applied to malware detection and classification tasks using static and dynamic analysis data exploiting its temporal [8,9,10], spatial [11,12], or spatio-temporal [13,14] structure.In this work, we propose a novel behavioral malware detection method that exploits yet another structure of the dynamic analysis data, the graph structure of the API call sequences.To accomplish this task, our method is based on a state-of-the-art Deep Learning architecture designed for graph classification; more specifically, the Deep Graph Convolutional Neural Network (DGCNN) [15].Due to their capability of learning from non-Euclidean data such as graphs, Graph Neural Networks (GNNs) [16,17] can be applied to problems in a vast range of domains from protein classification [18] to Materials science [19].By defining a graph structure to represent the API call sequence of a program, we combine both the spatial and temporal information from its behavior.Then, we introduce a simplified version of the DGCNN to learn high-level representations that can be used by a classifier to detect whether the program is malware.Experimental results show that the proposed method achieves similar AUC-ROC [20] and F1-Score to specialized Deep Learning architectures for sequence learning such as LSTM networks [21], widely used as the base architecture for behavioral malware detection methods [22].In particular, our models achieve higher AUC-ROC, F1-Score, Precision, Recall, and Accuracy than LSTM networks when trained and tested on a balanced dataset.To the best of our knowledge, this is the first paper that investigates the applicability of DGCNN to behavioral malware detection using API call sequences.The rest of the paper is organized as follows.Related work is reviewed in Section 2. A brief background on DGCNNs is introduced in Section 3. The proposed method is detailed in Section 4. Performance evaluation is described in Section 5. Results, discussion, limitations, and future work are presented in Section 6.Finally, conclusions are drawn in Section 7.

Related Work
Classical Deep Learning algorithms such as Convolutional Neural Networks (CNNs) [23] and LSTM networks have been successfully applied to malware detection and classification problems using both static and dynamic analysis data [22]; however, Deep Learning on graphs has been mainly applied to data extracted employing static analysis methods.[24] proposed a malware classification method (MAGIC) based on a modified version of the DGCNN to learn directly from attributed control flow graphs (ACFGs) extracted from disassembled binaries, in which each vertex summarizes code characteristics as numerical values.[25] introduced a malware detection approach using graph embedding to map the control flow graphs (CFGs) extracted from disassembled binaries to low-dimensional vectors as inputs for two stacked denoising autoencoders (SDAs) that are responsible for representation learning.[26] presented a method for Android malware detection that creates CFGs using data extracted from decompiled applications' source codes and information from their manifest files.Then, a Graph Convolutional Network (GCN) [27] was used to learn high-level representations that could be used in detection tasks.[28] studied the effectiveness of DGCNNs in processing large-scale graphs with hundreds of thousands of nodes by conducting experiments on malware detection and software defect prediction.Our work follows a similar approach to [24] and shares the same theoretical basis on applying DGCNNs for classification tasks [15].However, we use the standpoint of dynamic analysis by extracting behavioral graphs from the API call sequences and using both the API call sequences and the behavioral graphs as inputs to a modified version of the DGCNN.In addition, the standard LSTM network was chosen as a benchmark since it has been successfully applied as the base architecture for several behavioral malware detection and classification methods using API call sequences data [22].

Background on Deep Graph Convolutional Neural Networks
DGCNN is a state-of-the-art neural network architecture that can directly accept graphs of arbitrary structures to learn a graph classification function [15].In other words, DGCNNs deal with the task of graph classification as opposed to node classification [27].Let G be a directed graph of order n ∈ N and A ∈ Z n×n its associated adjacency matrix.Now, let us define the following: The augmented adjacency matrix of G, Ã = A + I n , to ensure that the convolution operation takes into account the features of each node as well as its neighbours' features.The augmented diagonal degree matrix of G, Di,i = j Ãi,j for row-wise normalization.The node feature matrix X ∈ R n×c , c ∈ N, where each row of X is a node "feature descriptor", and each column of X is a node "feature channel."The matrix of learning parameters W ∈ R c×c , where c ∈ N is the number of output feature channels, with the non-linear activation function f : R n×c → R n×c .Then, the graph convolution operation can be written as follows [15]: The graph convolution operation defined by Equation 1 aggregates local substructure information by considering the nodes' immediate neighborhoods.In order to extract multi-scale substructure features, Equation 1 can be stacked to form a deep network, using the following recurrence relation [15]: ), where Z (0) = X and In summary, standard DGCNNs have four sequential steps [15]: 1) Graph convolutional layers generalize the convolution operation from Euclidean domains or grid-like structures such as image data to non-Euclidean domains such as graph data by generating node representations as the aggregation of their own feature descriptors and their neighbors' feature descriptors.2) Unordered graph data from each convolutional layer are concatenated along their feature channels (or columns), resulting in matrix SortPooling layer sorts the unordered graph data according to their feature descriptors or structural roles.
This step guarantees that the nodes of different graphs will be placed in similar positions, according to their weighted feature descriptors.4) The ordered graph data is flattened and passed to a standard 1-dimensional CNN layer followed by a fully connected layer to learn a classification function.For a more comprehensive review, please refer to [15].

Proposed Method
As illustrated in Figure 1, our method has eight sequential steps from data gathering to detection.First of all, Portable Executable (PE) files ( 1) are fed to a Cuckoo Sandbox [29] environment (2), which in turn runs the PE files and generates raw JSON reports containing dynamic analysis data such as API call sequences, generated traffic and dropped files (3).Next, the API call sequences are extracted from the reports and post-processed in order to identify and convert the API calls into ordinal categorical values (4).At this point, we have tracked the temporal behavioral information from the PE files and the ordered set of all possible API calls.Behavioral graphs are then generated based on both the API call sequences and the set of API calls (5), and both are passed to a graph convolutional layer to learn high-level representations of the spatial and temporal relations among the API calls (6).If multiple graph convolutional layers are stacked together to form a deep network, it is necessary to concatenate their results in order to consider multi-scale substructure features.Finally, the learned representations are passed to a fully connected layer (7), followed by a sigmoid layer (8) binary classification.In the next sections, a more in-depth description of the method is presented. (

Data Collection and Post-Processing
We introduced a new public domain dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences each [30].Our motivation was twofold.On the one hand, we were motivated by the lack of public domain PE dynamic malware analysis dataset for training and evaluating our models.On the other hand, we were motivated by the desire to provide an open dataset that the research community could further utilize and extend.Malware samples were collected from VirusShare [31], and goodware samples were collected from both portablepps.com[32] and a 32-bit Windows 7 Ultimate directory.Both online download and local goodware were included to increase the variability of the dataset and decrease its imbalance.In order to gather the API call sequences from each sample, we chose Cuckoo Sandbox, which is a largely used, open-source automated malware analysis system capable of monitoring processes behavior while running in an isolated environment.Once the data was collected, three additional post-processing steps were performed.
1) Similar to [13], it was considered the first 100 non-consecutive repeated API calls to avoid tracking loops.
2) Since in malware detection tasks, it is prominent to recognize malicious patterns as early as possible, the sequences were extracted from the parent process only.

API Call Sequences and Behavioral Graphs Generation
On the one hand, API call sequences represent the most important part of the program behavior through time [13].On the other hand, graph structures encode spatial relations, such as adjacency and connectivity, between API calls.Our method leverages both temporal and spatial information for malware detection.
In order to accomplish that, it is necessary to extract the graph structure from the API call sequences to generate their associated behavioral graphs.A behavioral graph is a graph representation of a program run and, in particular, of its API call sequences [33].More formally, let us define a behavioral graph as a 2-tuple directed graph G = (N, E), where N ⊆ N is the ordered set of nodes representing the ordered set of API calls, E ⊆ N × N is the set of edges, and each edge e = (e i , e j ) ∈ E corresponds to the temporal relation between the two consecutive API calls e i and e j such that the order of execution is e i followed by e j .Let x = (x 0 , x 1 , ..., x L−1 ), x i ∈ N be an API call sequence of length L ∈ N.Then, the adjacency matrix of the behavioral graph G associated to x is given by:2 The above definition can be implemented as follows: As an example, let N = (0, 1, 2, 3) be the ordered set of API calls.Figure 2 step I shows the behavioral graph G resulting from the adjacency matrix generated by Equation 3 applied to the API call sequence x = (0, 1, 2, 0, 2, 3).

Deep Graph Convolutional Neural Networks and Graph Convolutional Layers
In order to take advantage of the DGCNN architecture, let us define the node feature matrix X ∈ {0, 1} |N |×L   of G as the result of one-hot encoding each x i in the API call sequence x.As an example to show how the graph convolution operation acts on X, let us again consider the ordered set of API calls N = (0, 1, 2, 3), the API call sequence x = (0, 1, 2, 0, 2, 3) and the adjacency matrix A, generated by Algorithm 1, associated to the behavioral graph G of x, depicted in Figure 1.For the sake of clarity, let us take the product AX and its As we can see, the resulting product AX is a matrix representing the API call sequence and, for each API call, its immediate indegree neighbors.Also, notice that the rows of AX represent ordered nodes, and the columns of X represent the behavior of the program in time given by the API call sequence x.Moreover, since the nodes of G are already sorted by their natural order, our model does not require the SortPooling layer introduced in [15], thus reducing its execution time.Finally, the term D−1 ÃX is multiplied by the weight matrix W , allowing the model to learn higher-level representations.For example, let us consider only the product AXW with W ∈ R 6×2 , then:

Fully Connected Layers and Sigmoid Layer
The last graph convolutional layer outputs a matrix Z (T ) ∈ R |N |×c T , T ∈ N. Finally, in order to train a fully connected layer on it, first it is necessary to concatenate Z (1) , . . ., Z (T ) along their columns and then reshape the result into a row vector z ∈ R |N | T t=1 ct .As an example, let us consider Equation 5. Notice that after reshaping, the high-level features will be grouped by the natural order of the nodes and can be passed to a fully connected layer followed by a sigmoid layer for binary classification: Figure 2, step III, depicts the high-level features grouped by the natural order of the nodes.

The Method
In summary, without considering the data collection and post-processing steps, our method can be implemented using Algorithm 2. According to the principles of Deep Learning [7], Algorithm 2 can be extended by stacking the graph convolutional layers or fully connected layers followed by the sigmoid layer for binary classification or a softmax layer multi-class classification.Furthermore, we included a Dropout [34] layer after each graph convolutional layer in order to prevent overfitting and used ReLU [35] as the activation function to perform non-linear transformations while preventing the vanishing gradient problem: Algorithm 3: LSTM input : API call sequence x output : The probability of x was generated by a malware

Performance Evaluation
The goal of our experiment was twofold.First, to measure the performance of our method in detecting malware considering a balanced dataset and the original imbalanced dataset of API call sequences.Second, to establish a fair performance comparison between our models and LSTM networks on the same task.In order to accomplish that, let us consider the dataset introduced in 4.1 and the following models: Model-1, a 1-layer DGCNN, according to Algorithm 2. Model-2, a 2-layer DGCNN, according to Algorithm 2. LSTM, an LSTM network followed by a fully connected layer that receives the last LSTM network hidden state vector and outputs the result to a sigmoid layer for binary classification, according to Algorithm 3. Two experiments were performed for model selection, training and evaluation: Experiment 5.1 and Experiment 5.2.In total, 1,296 models were defined, trained, and evaluated, resulting in 6 optimized models for malware detection using API call sequences.The total running time for the hyperparameter optimization was about 60 hours on an Intel Xeon D-1540, 8 cores, 16 threads, 2.6 GHz, 64 GB RAM, 2 TB SSD, and a GPU card NVIDIA RTX 2080 Ti.We used Jupyter Notebook [36], PyTorch [37], Pandas [38], scikit-learn [39], imbalanced-learn [40], skorch [41] and Matplotlib [42] for the implementations, which are available in [43].

Experiment 1
In the first experiment, a random undersampling of the majority class (malware) was first performed to obtain a balanced dataset containing 1,079 goodware samples and 1,079 malware samples, followed by a stratified train-test split where the size of the test set was 0.3 of the total, resulting in a training set of 738 goodware samples and 772 malware samples, and a test set of 341 goodware samples and 307 malware samples.Next, to perform model selection (i.e., hyperparameters optimization), an exhaustive grid search was executed on the training set using stratified 5-fold cross-validation with AUC-ROC as the validation metric.In an exhaustive grid search, the model is trained and evaluated with all the hyperparameters combinations.The stratified k-fold cross-validation ensures that each training set split contains a similar proportion of positive and negative samples.Then, the model is trained with k − 1 folds, and then its performance is evaluated using the fold that was left out of the training process.This process is repeated k times, each time considering a different set of folds.The average of the evaluation performances is an estimate of the model's performance on unseen data.The number of folds was set to 5, considering the computational resources needed to perform the grid search and the degree of bias and variance on the resulting metric [44].In addition, AUC-ROC was  1: Hyperparameter optimization search space.W (0) and W (1) are the sizes of the output channels for Model-1 or Model-2 for each layer.H is the size of the hidden layer for the LSTM network.
chosen as the validation metric because it provides a summary of performance across all possible classification thresholds and can be used in balanced and imbalanced datasets [45].Taking into account that we desired to compare conceptually different models, the following hyperparameters were considered for the grid search: Number of parameters per layer, dropout rate, mini-batch size, and the number of epochs.In order to define the sets of parameters per layer for the search, we first defined 4 possible values for the size of the LSTM hidden layer, then, considering Algorithm 2, it is possible to show that the total number of parameters for a 2-layer DGCNN with one fully connected layer followed by a sigmoid layer is given by: , where P i ∈ N is the number of parameters of layer i. (7) By using Equation 7, it was possible to build sets of numbers of parameters per layer, where the maximum number of parameters for each DGCNN model is less than the maximum number of parameters of the LSTM network, thus ensuring that all models are limited to a maximum complexity regarding the number of parameters while having the same degree of freedom for tuning.Finally, Binary Cross-Entropy function was set as the loss function, and Adam optimizer [46] was used on a learning rate of 0.001, beta1 of 0.9, and beta2 of 0.999.Table 1 summarizes the hyperparameters optimization search space.Once the best sets of hyperparameters were found, the models were trained on the entire training set using those hyperparameters and tested against the test set.

Experiment 2
In the second experiment, the original imbalanced dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences was considered without undersampling.Then, the same procedures of Experiment 5.1 were followed.
6 Results, Discussion, Limitations, and Future Work Our results show that our method is indeed applicable to the problem of behavioral malware detection using API call sequences, and its performance is similar to LSTM networks, which are widely used as the base architecture for behavioral malware detection methods, thus indicating that DGCNNs can also be useful as the base architecture for more complex and deeper models for the same purpose.Figure 3 depicts the ROC curves of Model-1, Model-2, and the LSTM network for the balanced and imbalanced cases.

Balanced Dataset
As we can see in Table 2, our models achieve the highest AUC-ROC, F1-score, precision, recall, and accuracy.
A particularly important performance metric when evaluating malware detectors is the recall.In balanced datasets, a high recall implies a low number of false negatives, which can be interpreted as a high malware detection rate.High precision implies a low number of false positives, which is less critical but is desired for malware detectors.Ideally, both recall and precision should be high, implying a high F1-score.Finally, high accuracy implies a high number of correct overall predictions.

Imbalanced Dataset
As we can see in Table 3, LSTM networks achieve the best results, followed by Model-2 and Model-1, respectively; however, notice that our models are capable of learning a classification function using considerably fewer parameters and epochs.AUC-ROC is the most reliable metric in this scenario [45] since even the Dummy detector achieves a relatively high F1-score and, consequently, high recall and precision.

General Considerations
In general, our models achieved similar performances to LSTM networks on the proposed task.As Tables 4   and 5 show, Model-1 and Model-2 dropout rates are the highest as opposed to the number of parameters.
In fact, our models overfitted the training set just after ten epochs on average, indicating that additional dropout layers or L2 regularization [47], as well as the addition of more examples, could further improve their performance.We plan to investigate deeper architectures in future work.The addition of one graph convolutional layer did not greatly improve the models' performance.Further improvement could be possible by introducing a larger dataset containing more goodware samples to reduce the imbalance.Although in this work, it was considered only the task of malware detection or binary classification, DGCNNs can also be applied to the problem of multiclass classification [15], suggesting that our method can also be extended to multiclass behavioral malware classification.Future work will investigate this possibility.In addition, notice that our work only took into account one kind of execution trace, the API call sequences.In future work, we plan to augment our datasets to include other dynamic traces such as the API call parameters, return values, and execution status.

Visualization
In an attempt to visualize the inner workings of the models, we applied Principal Component Analysis (PCA) [46] to the sets of activations in the hidden layer preceding the fully connected layer during the evaluation phases.Deeper layers should contain high-level features that are able to be separated into classes by the fully connected layer.

Conclusion
In this paper, we propose a novel behavioral malware detection method based on DGCNNs to learn directly from API call sequences.In order to train, evaluate, and test the models, we introduced a new public domain   A preprint -October 24, 2019 dynamic analysis dataset of more than 40k API call sequences of malware and goodware.By extracting behavioral graphs from the API calls sequences and using both as inputs for a modified version of the DGCNN, our method achieves similar performance to LSTM networks, which are largely applied as the base architecture for malware detection and classification methods using dynamic analysis data, including API call sequences, thus indicating that our method can also be useful as the base architecture for more complex and deeper models for the same purpose.Even though DGCNNs are memory-less networks, as opposed to LSTM networks, our results show that the graph structure of the API call sequences plays an essential role in the problem of detecting whether a program is malware.Future work will explore deeper architectures as well as the problem of multiclass malware classification using API call sequences and their associated behavioral graphs.

Figure 1 :
Figure 1: High-level flow of the proposed method.

Algorithm 1 :
Seq2BGraph input : API call sequence x of length L ∈ N. Cardinality of N: |N | ∈ N output : Adjacency matrix A of the behavioral graph G associated to x Let A be a new zero-initialized |N | × |N | matrix for i ← 0 to L − 2 do A[x i , x i+1 ] ← 1 end

Figure 2 :
Figure 2: Behavioral graphs generation and graph convolution operations.0 * , . . ., 3 * , represent the high-level features grouped by the natural order of the nodes.

Algorithm 2 :
The Model input : API call sequence x.Cardinality of N: |N | output : The probability of x was generated by a malware Ã ← Seq2BGraph(x, |N |)

Figure 4 (
a) shows the result of PCA applied to the test set, Figure 4 (b) shows the result of PCA applied to LSTM networks, and Figures 4 (c) and (d) show the PCA visualization for Model-1 and Model-2, respectively.Intuitively, LSTM networks separated the classes into two large groups, while Model-1 and Model-2 separated the classes into several smaller groups.Taking that into account, it is interesting to consider how a DGCNN-based behavioral malware classification method would behave in a multiclass classification problem.We plan to investigate that phenomenon in future work.

Figure 4 :
Figure 4: Visualization using Principal Component Analysis.The red dots represent malware instances, and the green dots represent goodware instances.
3) We built the list of unique API calls, considering all the samples, and then converted each API call name into a unique integer identifier equal to the index of the API call name in the list.As a result, 307 distinct API calls were identified.We produced a dataset where the first column contains the MD5 hash of the sample.The next 100 columns contain ordinal categorical values between 0 and 306, representing the API call sequence of the sample.The last column contains the label of the sample, 0 for goodware, and 1 for malware.The total running time to collect the data was about 3000 hours, resulting in approximately 50,000 Cuckoo JSON report files and 1.5 TB of raw data.Our Cuckoo sandbox environment was based on an Intel Xeon D-1540, 8 cores, 16 threads, 2.6 GHz, 64 GB RAM, and 2 TB SSD running Ubuntu Server 16.04 as the Cuckoo host and 8 32-bit Windows 7 Ultimate VirtualBox virtual machines running in parallel as Cuckoo analysis guests.

Table 2 :
Performance metrics on the balanced dataset.The Dummy detector predicts all the samples as malware.

Table 3 :
Performance metrics on the original (imbalanced) dataset.The Dummy detector predicts all the samples as malware.

Table 4 :
Model selection results on the balanced dataset

Table 5 :
Model selection results on the original (imbalanced) dataset