C HIMERA : A N A NDROID M ALWARE D ETECTION M ETHOD B ASED ON M ULTIMODAL D EEP L EARNING AND H YBRID A NALYSIS

detection Accuracy, Precision, Recall, and ROC AUC outperform classical ML algorithms, state-of-the-art Ensemble, and Voting Ensembles ML methods, as well as unimodal DL methods using CNNs, DNNs, TNs, and Long-Short Term Memory Networks (LSTM). To the best of our knowledge, this is the ﬁrst work that successfully applies multimodal DL to combine those three different modalities of data using DNNs, CNNs, and TNs to learn a shared representation that can be used in Android malware detection tasks.


Introduction
Malware, or malicious software, is any software intentionally designed to cause harm to a computer, user, or network [1]. Some malware examples include but are not limited to adwares, backdoors, ransomwares, rootkits, trojan horses, viruses, and worms. Each kind (or family) of malware shares common properties regarding the exploitation techniques used in the victim's environment. In general, Android malware can be characterized by their malicious payloads main targets as financial charges, personal information stealing, privilege escalation, and remote control [2]. Malware poses a significant threat to both end-users and corporations. While end-users might have their personal and financial data stolen or encrypted by ransomwares, corporations might have their security perimeters breached by backdoors and rootkits. The development of defensive mechanisms against malware depends on knowing how malware works. In order to understand malware internals and behavior, one can make use of Malware Analysis techniques.
Malware Analysis is a set of techniques used to dissect malware and understand how it works in order to identify and defeat it [1]. It is based on a subset of techniques known as Static Analysis, Dynamic Analysis, and Hybrid Analysis [3]. Static Analysis provides a set of tools and techniques to understand how malware works without executing it [3]. The main advantage of Static Analysis is that there is no need to execute the malicious file to analyze it. Specialized tools such as decompilers and disassemblers make the information extraction process fast and straightforward. One disadvantage of Static Analysis is related to malware code obfuscation. When the malicious code is obfuscated or encrypted, direct Analysis of the disassembled code is practically impossible. In this scenario, it is helpful to make use of Dynamic Analysis. Dynamic Analysis provides a set of tools and techniques to understand how malware works by executing it in a controlled, isolated environment known as sandbox [3]. By collecting information from the malware execution traces such as system calls, API calls, network activity, file system activity, and so on, it is possible to draw a profile of the application based on its actual behavior. The main advantage of Dynamic Analysis is that its immune to code is obfuscation. Some disadvantages of Dynamic Analysis include its low code coverage and high time consumption. Hybrid Analysis leverages both Static Analysis and Dynamic Analysis to understand malware effectively by taking advantage of both Static and Dynamic Analysis resources.
The information gathered using Malware Analysis can be leveraged for malware detection and classification tasks [4]. The most common, faster, and simpler way of detecting malware is using signature-based methods [4]. Signature-based methods rely on the extraction of malicious patterns from known malware using Malware Analysis techniques. Once the malicious patterns are collected, their presence or absence can be quickly verified in the suspicious files. Some disadvantages of signature-based methods are their inefficiency in detecting polymorphic and metamorphic malware [5], zero-day malware, and the high rate of false positives. Moreover, since polymorphic and metamorphic malware can change their code implementations in runtime, the malicious patterns known by signature-based methods can also be changed in the process, increasing the number of false-negatives.
In order to increase the accuracy of malware detection and classification methods, several ML and DL methods have been proposed. For a comprehensive review, please refer to [6,7]. The main advantage of ML malware detection methods is their capability of learning malicious patterns from data (e.g., real-world malware samples); however, ML techniques frequently require manual feature engineering in order to achieve higher accuracy, which usually requires a highly specialized workforce, and it is time-consuming. Deep Learning malware detection methods leverage specialized architectures designed for image processing, speech recognition, sequence learning, and so on [8]. The main advantage of DL methods is their capability of performing automatic feature learning from structured and non-structured data of different domains, thus decreasing the work associated with manual feature engineering [8]. In fact, DL methods have achieved state-of-the-art results in image recognition tasks, speech recognition, and natural language processing [8]. DL methods' main disadvantage is the requirement of large volumes of data and processing power to achieve higher accuracy.
More recently, Android malware detection methods using multimodal DL have also been proposed [9,10,11,12,13]. Multimodal DL uses independent, specialized DL subnetworks to extract high-level feature representations from different data modalities and combines the resulting embeddings into a shared representation that can be used for classification and regression tasks [14]. For example, in Multimodal DL, it is possible to combine both audio and video data for a classification task and achieve better accuracy than using audio and video independently in a unimodal architecture for the same task [14]. In this work, we propose Chimera, a new Android malware detection method based on multimodal DL and hybrid Analysis. As we can see in Figure 1, Chimera is composed of 3 independent DL subnetworks: (1) Chimera-Static (Chimera-S), a DNN [8] to learn high-level feature representations from Android Intents & Permissions using an early fusion layer. (2) Chimera-Raw (Chimera-R), a CNN [15] to learn high-level feature representations from raw data transformed into DEX grayscale images, and (3) Chimera-Dynamic (Chimera-D), a TN encoder [16] to learn high-level feature representations from the system call sequences. Finally, the intermediate fusion network is responsible for implementing shared high-level feature representations using each subnetwork's internal representations and learning correlations that can be leveraged for Android malware detection. Our experiments' results indicate that Chimera's detection Accuracy, Precision, Recall, and ROC AUC outperform classical ML algorithms, state-of-the-art Ensemble and Voting Ensembles ML methods, as well as unimodal DL methods using CNNs, DNNs, TNs, and LSTMs. To the best of our knowledge, this is the first work that uses a combination of raw data, static analysis data, and dynamic analysis data inputs for a multimodal DL method based on CNNs, DNNs, and TNs to learn shared representations that can be applied to Android malware detection. The rest of this paper is organized as follows: Section 2 details the proposed method and the methodology adopted in the research. Section 3 provides the evaluation methods,  Figure 1: Chimera Android malware detection method architecture.
metrics, results, and performance comparisons between our method and different DL, ML, Ensemble ML, and Voting Ensemble ML methods, as well as a discussion of the obtained results, limitations, and future work. Section 4 outlines the related work. Lastly, Section 5 summarizes the results and impact of our research.

Proposed Method and Methodology
This work follows the Knowledge Discovery in Databases (KDD) process [17] and Supervised ML methodology [18] for model selection, training, and evaluation. Supervised ML performs the task of learning a function that maps an input to an output using input-output pairs to adjust the model's parameters. Therefore, in Supervised ML, it is essential to define a labeled training set containing a representative number of instances of each class to promote learning, i.e., low generalization error on the evaluation/test set following the same training set distribution [18]. KDD is an iterative, interactive, and non-trivial process composed of several stages for extracting (useful) patterns from large databases. Figure 2 depicts the KDD process implementation for Chimera. Each KDD stage, i.e., Selection, Preprocessing, Transformation, Data Mining, and Interpretation, is represented by a blue box containing the implementation steps (white boxes) performed in that particular stage. Notice that the KDD stages Selection, Preprocessing, and Transformation contain implementation steps related to data preparation, and the Data Mining and Interpretation stages contain implementation steps related to ML methodology such as model selection, training, and  evaluation. Since Chimera is a multimodal method, each DL subnetwork (Chimera-S, Chimera-R, and Chimera-D) is responsible for feature extraction from a different data source. Therefore, the Selection, Preprocessing, Transformation, and Data Mining stages are performed independently for each DL subnetwork. Moreover, an additional Data Mining implementation step is executed over the intermediate fusion layer built using high-level feature representations learned by each DL subnetwork. Finally, the Interpretation stage is performed by the last Chimera's DNN classifier layer, resulting in a probability distribution used for binary classification or, more concretely, for Android malware detection.
In the context of the KDD process, the knowledge produced by our method can be summarized by its generalization performance, i.e., the detection accuracy resulting from 10-fold cross-validation. The following sections detail the implementation of the KDD process and ML methodology for Chimera.

Selection, Preprocessing and Transformation
In this work, we used the Android benchmark dataset Omnidroid, introduced by [19]. Omnidroid is a balanced dataset composed of pre-static, static, and dynamic analysis information extracted from 22,000 real malware and benign Android applications. After applying several data preparation and consolidation techniques [19], it presents the information in a structured format (Comma Separated Values (CSV) and Javascript Object Notation (JSON) files). However, it is relevant to note that the Omnidroid dataset does not include the APK (Android Application Package) files used to extract the information due to legal restrictions. Since Chimera-R and Chimera use data from the DEX files,  which are part of the APK files, we also downloaded the APK files from the online malware repositories Androzoo 1 using its free account access, and from Koodous 2 using its premium account access. As we can see in Figure 2 in the KDD Selection stage, we used the SHA256 hashes presented in Omnidroid to download the associated APKs from the online repositories.

Chimera-S
Inspired by the work presented in [20,21,22], Chimera-S and Chimera implement an early fusion layer to combine both Android Intents and Android Permissions for malware detection. Android Intents and Android Permissions play an essential role in the Android security architecture by controlling the actions that applications can perform on the OS and the communication between applications. Moreover, as shown by [20], Android Intents and Android Permissions present high discriminative power and low correlation. Thus, features designed using Android Intents and Permissions have high predictive quality. Omnidroid includes the set of Android Intents and Android Permissions for each application. We extracted the top-100 Android Intents and the top-100 Android Permissions from Omnidroid's JSON files, concatenated them into a 200-dimensional feature vector, and saved the result into a CSV file using binary encoding to indicate the presence or absence of a particular Android Intent or Android Permission for each instance. Finally, the Transformation stage was performed by applying a Standardization procedure [18] to set the mean value of each feature to 0 and its standard deviation to 1. This procedure is used to speed up training convergence. Figure 2 summarizes the implementation steps of the KDD process stages Selection, Preprocessing, and Transformation for Chimera-S and Chimera.  Figure 4: System call sequences of two benign applications in the first column and two Trojan malware in the second column, including the SHA256 hash of each instance. The x-axis represents the time step. The y-axis represents the system call number.

Chimera-R
DEX files contain the executable code in Dalvik format. Similar to the Java Virtual Machine (JVM), the Dalvik Virtual Machine translates DEX opcodes (bytecodes) into native CPU instructions. The DEX format provides a compact and optimized executable module [23]. The methods proposed by [24,25] make use of DEX bytecodes as images for malware detection and classification tasks. Figure 3 depicts grayscale images representing two benign Android applications and two Android malware instances. Motivated by their work, Chimera-R and Chimera also use data from the DEX files for Android malware detection to perform automatic feature extraction from DEX grayscale images using CNNs. Since DEX files contain non-structured data, we extracted the DEX files from each APK file and saved their content into a NoSQL database. Finally, the Transformation stage was performed by resampling the data to image representations of 1x128x128 pixels (channel, width, height) using the Lanczos resampling algorithm [26], and by applying a Scaling procedure [18] in order to set the values of the features to the same scale (between 0 and 1), and a Standardization procedure [18] in order to set the mean value of the grayscale channel to 0 and its standard deviation to 1. Scaling and Standardization procedures are used to speed up training convergence. We chose the image dimensions to be 1x128x128 since preliminary experiments indicated that smaller images promoted underfitting, while larger images consumed four times more resources (RAM, GPU memory, and processing power) without presenting significant performance improvements. The Lanczos resampling algorithm was chosen because it is considered to present the best compromise for image resampling tasks [26]. Figure 2 summarizes the implementation steps of the KDD process stages Selection, Preprocessing, and Transformation for Chimera-R and Chimera.

Chimera-D
Based on the work introduced by [27], Chimera-D and Chimera also depend on data collected using Dynamic Analysis, particularly system call sequences. System call sequences represent the application behavior through time. More specifically, system call sequences represent the application's interaction with the hardware by calling low-level functions exposed by the OS. Figure 4 presents the system call sequences of two benign Android applications and two Android malware instances, each containing 100-time steps. Omnidroid includes the system calls sequences logged by the strace tool and saved into CSV files. To reduce noise and avoid loops, we removed all the consecutive repeating system calls. Then, we trimmed the sequences to 400-time steps since the smallest resulting sequence after removing the consecutive repeating system calls contained 415-time steps. The information was extracted from Omnidroid's CSV files and saved into a CSV file using an integer encoding where each number is associated with a system call. In total, 124 unique system calls were identified. Finally, the Transformation stage is performed by converting the integer encoded feature to its one-hot encoding representation [18] during the training and evaluation processes. That is a necessary step since each system call should be treated as a categorical feature, converted to its one-hot encoding representation, and then used as input to Chimera-D, which is based on a TN encoder. Figure 2 summarizes the implementation steps of the KDD process stages Selection, Preprocessing, and Transformation for Chimera-D and Chimera.  Figure 5: Chimera's Android malware detection method DL subnetworks for Static Analysis data (Chimera-S), DEX grayscale images (Chimera-R), and Dynamic Analysis data (Chimera-D).

Data Mining and Interpretation
Supervised DL is composed of several techniques for data mining of structured and non-structured data and can be used for classification and regression tasks [28]. DL also provides several techniques to mitigate underfitting and overfitting. Overfitting is a common problem found when building and evaluating DL models since, by definition, DL models are composed of multiple layers and a large number of parameters. Overfitting takes place when a model loses its generalization power. More precisely, the training error reaches a low value, meaning that the network was able to fit the training set accurately; however, at the same time, the evaluation error reaches a high value, meaning that the network was not able to generalize or learn the required patterns from the training set. Commonly used approaches to mitigating overfitting are increasing the dataset's size and variability, decreasing the model's complexity (layers and the total number of parameters) or using specialized DL techniques designed for data standardization in the hidden layers and reduction of the model's complexity. The most common techniques used for mitigating overfitting are Dropout [29] and Batch Normalization [30]. Dropout is used as a regularization mechanism for reducing overfitting by randomly zeroing out the activations' values to prevent complex co-adaptations on the training data, resulting in the thinning of the model's weights by the Backpropagation algorithm [29]. Batch Normalization is used to mitigate internal covariate shift, which is the change in the distribution of the network's activation values, resulting in training stability and slow convergence. Our experiments indicated that both Batch Normalization and Dropout layers play an important role in mitigating overfitting and the training stability in Chimera-S, Chimera-R, Chimera-D, and Chimera fully connected layers, and Batch Normalization plays a similar role for the convolutional layers of Chimera-R and Chimera. Finally, we choose to apply by default the activation function Rectified Linear Unit (ReLU) [31] to introduce non-linearity, help mitigate the vanishing gradient problem, and speed up training convergence.
Taking into account that Chimera-S, Chimera-R, Chimera-D, and Chimera are binary classifiers that can be extended to multiclass classifiers in future work, we included the Softmax activation function after the output layer to encode the high-level feature representations into a probability distribution that can be used for binary classification [28]. Consequently, the Cross-Entropy loss function [28] was introduced to quantify the training and evaluation errors during the learning process. Finally, we chose the Adaptive Moment Estimation (Adam) optimizer [32] to train the models. Adam is a state-of-the-art adaptive learning rate optimization algorithm designed for training DL networks. Adam leverages both momentum and learning rate adaptation to accelerate convergence and avoid local minima and plateaus.
In order to choose the best architectures for Chimera-S, Chimera-R, Chimera-D, and Chimera, we performed model selection (or hyperparameter tuning) using grid search cross-validation with 10-fold cross-validation to estimate the generalization error [33]. Grid search cross-validation exhaustively generates candidate architectures using a supplied grid of hyperparameters values and applies a 10-fold cross-validation procedure to estimate the model's generalization error based on a selected performance metric. In 10-fold cross-validation, the dataset is initially shuffled and split into ten parts containing approximately the same number of instances and the same proportion of malware and goodware instances each. Next, the selected model is trained on nine parts and evaluated on the remaining part. The process repeats until all the parts have been selected for evaluation. Finally, the estimation of the model's performance is calculated by averaging the results of each evaluation. In this work, we choose the Accuracy metric (See Equation 1) to guide the model selection process since the Ominidroid dataset is balanced; thus, the Accuracy metric represents the percentage of correct predictions on the evaluation sets.

Chimera-S
As depicted in Figure 5, Chimera-S introduces a DNN architecture [28] with one input layer containing 200 neurons: 100 neurons for Android Intentions features and 100 neurons for Android Permissions features. Two hidden layers containing 256 and 128 neurons respectively, and one output layer containing two neurons followed by a Softmax layer. Each fully connected layer is followed by a ReLU activation function. We included Dropout and Batch Normalization layers between the fully connected layers and between the output layer and the Softmax layer to mitigate overfitting. In addition, the best results were found when using the step decay schedule for the learning rate decay strategy. In the step decay schedule, the learning rate is reduced by a factor every predefined number of epochs, which might result in faster training convergence.
The following hyperparameters were considered for model selection: Please see Figure 5 for the optimized Chimera-S architecture and Figure 1 for the optimized Chimera architecture. Please refer to Section 3 for detailed performance evaluation and discussion of the trained models Chimera-S and Chimera using the optimal hyperparameters presented in this section.

Chimera-R
As we can see in Figure 3, the 2nd row depicts two malware instances from the same family (Trojan). It is easy to see that both instances share common spatial (visual) patterns. The same holds for the benign instances in the 1st row. If the spatial patterns across the instances of a dataset have enough discriminative power to identify the instance's class, then it is possible to use ML or DL techniques to leverage the information contained in the spatial patterns for detection and classification tasks. In fact, [24] proposed a CNN architecture for Android malware classification using DEX grayscale ...
images, and [25] introduced a CNN architecture for Android malware detection using DEX opcodes translated to RGB images. CNN is a class of DL network commonly applied to computer vision problems [15] and were inspired by the animal visual cortex. CNNs are shift-invariant and based on shared-weights. These properties allow CNNs to learn spatial patterns from images and reuse them to recognize those patterns independently of their positions. Moreover, shared-weights reduce overfitting and training/inference time. As an example, a CNN can learn a filter (or kernel) able to recognize a high-level feature such as an eye and another filter able to recognize another high-level feature such as a nose, and by using multiple convolutional layers, CNNs combine both high-level features into higher-level features that can be used for face recognition.
Our work follows a similar approach proposed by [24,25] and introduces a new CNN architecture inspired by the Residual Networks (ResNet) architecture [34]. As we can see in Figure 5, Chimera-R is composed of 4 convolutional layers used for feature extraction and a final DNN used for Android malware detection. The 5-tuple that defines each convolutional layer comprises the number of input channels, the number of output channels, the filter (or kernel) size, the stride of the filter, and the padding [28]. Notice that the number of output channels doubles in the second and third layers, and finally, the number of output channels is multiplied by 4 in the last 1x1 convolutional layer [35]. Also, notice that the stride used in the first, second, and third layers is equal to 2. The effect of those hyperparameters in the architecture is as follows: After the 1st or second convolutional layer processes the image, its dimensions (width and height) are reduced by a factor of 2, and the number of extracted feature maps (depth) is increased by a factor of 2, thus, at the same time increasing the number of feature maps containing high-level feature representations and reducing their dimensionality. Finally, the Global Average Pooling operation [28] is applied after the 1x1 convolution to collapse the resulting tensor of feature maps into a tensor of real numbers that summarize each feature map. See Figure 6 for a simplified, out of scale representation of the Chimera-R CNN architecture. From this point on, the information is passed to a DNN with one hidden layer containing 128 neurons and one output layer containing two neurons, followed by a Softmax activation function. Each convolutional layer and fully connected layer is followed by a ReLU activation function to introduce non-linearity and prevent the vanishing gradient problem. To mitigate overfitting, similar to Chimera-S, we included Dropout and Batch Normalization layers between the fully connected layers and Batch Normalization between the convolutional layers. Contrary to what was verified for Chimera-S, our experiments indicated that the use of both Dropout and Batch Normalization layers between the convolutional layers led to overfitting and training instability. In addition, similar to Chimera-S (2.2.1), best results were found when using the step decay schedule for the learning rate decay strategy.
The following hyperparameters were considered for model selection: Please see Figure 5 for the optimized Chimera-R architecture and Figure 1 for the optimized Chimera architecture. Please refer to Section 3 for detailed performance evaluation and discussion of the trained models Chimera-R and Chimera using the optimal hyperparameters presented in this section.

Chimera-D
As we can see in Figure 4, the 2nd column depicts two malware instances' system call sequences overlapped. Also, notice that those two malware instances belong to the same family (Trojan). It is easy to see that both instances share common temporal patterns. The same holds for the benign instances in the 1st column. Suppose the temporal patterns across the instances of a dataset have enough discriminative power to identify the instance's class. In that case, it is possible to use ML or DL techniques to leverage the information contained in the temporal patterns for detection and classification tasks. In fact, [27] proposed an LSTM architecture to implement a neural probabilistic language model for Android malware detection using system call sequences. LSTM is a class of DL networks based on Recurrent Neural Networks (RNN) and capable of learning long-term dependencies on temporal/sequential data [36]. Our work is based on a different architecture for sequence learning: the Transformer Networks (TN) [16]. TN is a state-of-the-art encoder-decoder DL architecture designed to handle sequential data, such as natural language. Unlike RNNs, Transformers do not need to process sequential data in order. Due to this feature, Transformers facilitate parallelization during training time. Also, Transformers implement the Attention mechanism. Attention is used to let the network access any previous states and weights and learn which ones are more relevant for the task at hand. As we can see in Figure 5, Chimera-D is composed of a positional encoder that is used to add positional information to the inputs represented as 124-dimensional one-hot encoding vectors (2.1.3). The result is passed to the TN encoder layer for sequence learning and temporal feature extraction. Finally, a DNN is used for Android malware detection. Notice that Chimera-D and Chimera only make use of the encoder part of the TN. The TN encoder comprises an input layer of 124 neurons, a feedforward layer of 512 neurons, and four attention heads. The DNN contains three layers. The first layer has 400 * 124 neurons representing the high-level features extracted by the TN encoder. The second layer is composed of 128 neurons, and the output layer contains two neurons, followed by a Softmax activation function. Similar to Chimera-S and Chimera-R, we included Dropout and Batch Normalization layers after the TN encoder and the fully connected layers to mitigate overfitting and increase training stability. We used the ReLU activation function in the TN encoder and in Chimera-D to introduce non-linearity. To train Chimera-D, we used a different learning rate scheduling strategy, the learning rate warm-up, to increase the learning rate after every epoch by a constant factor. Learning rate warm-up mitigates premature convergence, and it is an essential technique for training TNs [16].
The following hyperparameters were considered for model selection: Please see Figure 5 for the optimized Chimera-D architecture and Figure 1 for the optimized Chimera architecture. Please refer to Section 3 for detailed performance results of the trained models Chimera-D and Chimera using the optimal hyperparameters presented in this section.

Chimera
As we can see in Figure 1, once the subnetworks Chimera-S, Chimera-R, and Chimera-D have forward propagated their inputs, a shared representation layer is implemented by concatenating (feature-wise) their results and passed to the last Chimera's DNN classifier for Android malware detection. Similar to what was verified by [9,37], we found out that training Chimera as a single model resulted in underfitting one subnetwork and overfitting of the other subnetworks. Taking that into account, we trained Chimera-S, Chimera-R, and Chimera-D separately using the optimized hyperparameters presented in the Sections 2.2.1, 2.2.2, and 2.2.3 respectively, and used Transfer Learning [38] to combine them into the final Chimera architecture. Transfer Learning works by training each model separately and saving their weights for later use and integration with other models. During training time, the weights of the pre-trained models are frozen, and the weights of the new model are trained, taking advantage of what was previously learned by the pre-trained models. Notice that Chimera's subnetworks do not include the last fully connected layers from their counterparts since Chimera itself needs to be optimized and trained to classify the high-level feature representations from the intermediate fusion layer. Finally, as we can see in Figure 1, Chimera's DNN classifier is composed of one input layer containing 384 neurons, one hidden layer containing 512, and one output layer containing two neurons followed by a Softmax activation function. A ReLU activation function follows each fully connected layer to introduce non-linearity. We included Dropout and Batch Normalization layers between the fully connected layers to mitigate overfitting. Moreover, similar to Chimera-S (2.2.1) and Chimera-R (2.2.2), best results were found when using the step decay schedule for the learning rate decay strategy.
The following hyperparameters were considered for model selection: Please see Figure 1 for the optimized Chimera architecture. Please refer to Section 3 for detailed performance results of the trained Chimera model using the optimal hyperparameters presented in this section.

Performance Evaluation and Discussion
To evaluate our method, we performed 10-fold cross-validation using the preprocessed/transformed Omnidroid dataset (See 2.1) on the following ML/DL algorithms/methods: Notice that we set the number of CPU cores to four to all the ML and Ensemble ML methods. The number of estimators to 100 for the Ensemble ML methods and all the other hyperparameters were set to the default values used in the scikit-learn library [40]. The LSTM network architecture was defined as follows: LSTM hidden layer of 256 neurons and a fully connected layer of 256 neurons, followed by a Softmax activation function. Also, we applied Dropout and Batch Normalization to mitigate overfitting. Finally, we compared Chimera's performance results to the state-of-the-art Voting Ensemble ML method results presented in [19].
The following performance metrics were chosen for results evaluation and comparisons: Where TP, TN, FP, FN stand for True Positive, True Negative, False Positive, and False Negative, respectively.
In the context of this work, the Accuracy metric (See Equation 1) represents the total number of correct detections over the total number of instances. Since the Omnidroid dataset is a balanced dataset [19], the Accuracy metric directly represents the percentage of correct detections. Notice that if the number of false positives and the number of false negatives are equal to zero, the method achieved the highest possible Accuracy. Therefore, the higher the Accuracy, the better is the overall method's performance.  important to notice that, in the context of malware detection methods, false negatives pose a much more significant threat to the users than false positives. On the one hand, a false positive means that goodware was detected as malware, which usually does not cause any harm; on the other hand, a false negative means malware was detected as a goodware, thus, bypassing the detection method. Also, we used the Area Under the Receiver Operating Characteristic Curve (AUC ROC) metric [41] and the Fit Time for performance comparisons. The AUC ROC metric is used to summarize a binary classifier's performance as its discrimination threshold is varied. Fit Time refers to the amount of time (in seconds) necessary to train each method.
As we can see in Tables 1, 2, and 3, the results of 10-fold cross-validation using Static Analysis data (Android Intents & Permissions), Static Analysis data (DEX grayscale images), and Dynamic Analysis data (system call sequences) show that Chimera achieved the best performance for all the considered metrics except the Fit Time. Chimera uses a shared representation layer for Android malware detection; thus, it takes advantage of multiple data modalities. Moreover, it also takes advantage of automatic feature engineering by using DL architectures and manual feature engineering applied to (1) The early fusion layer of Chimera-S, (2) The system call sequences of Chimera-D, and (3) The DEX images resampling of Chimera-R. Regarding the Fit Time metric, it is possible to accelerate training/inference by using a faster Graphical Processing Unit (GPU). It is important to notice that Chimera achieved higher performance than its subnetworks evaluated independently as we can see in Tables 1, 2, and 3. The reason for that is because Chimera learned to correlate information from multiple modalities of data, increasing true positives and true negatives, and decreasing the number of false positives and false negatives, thus increasing its Accuracy, Precision, Recall, and AUC ROC.
As presented in Table 1, Chimera-S achieved fourth place, showing better performance than all the classical ML algorithms and some of the Ensemble ML algorithms. Notice that Chimera-S achieved the second-best Precision. Although Chimera-S implements a DNN, its architecture is relatively shallow. Future work will investigate more deep architectures for Chimera-S. Table 2, Chimera-R achieved second place, showing better performance than all the classical and Ensemble ML algorithms. Chimera-R is based on CNNs, which are specialized DL architectures for image processing. Interestingly enough, the RBF SVM achieved third place, showing better performance than all the classical and Ensemble ML algorithms; however, since the RBF SVM presents O(n 3 ) time complexity and O(n 2 ) space complexity, it does not scale well for problems with large feature vectors, consequently requiring a large amount of resources and processing time. Our experiment took approximately five hours for fitting the RBF SVM, approximately three minutes for fitting Chimera-R, and approximately four minutes for fitting Chimera. Notice that the K-NN algorithm achieved the best Recall. Also, notice that its Accuracy and Precision present low values. This scenario indicates that the K-NN algorithm became biased towards malware instances, thus classifying most instances as malware, which increased the Recall and decreased the Precision, Accuracy, and AUC ROC. Table 3, Chimera-D achieved fifth place, showing better performance than all the classical ML algorithms and some Ensemble ML algorithms. Chimera-D is based on TNs, which are state-of-the-art DL architectures   Table 3 show that all the algorithms except Chimera achieved less than 72% Accuracy. A possible reason for that is that the sequences of system calls do not present a high discriminative power. Another reason can be related to the size of the sequences used (2.1.3). The results presented in Table 3 clearly show one advantage of using a multimodal method such as Chimera, which does not depend only on one data modality.

As presented in
Finally, we also compared Chimera to the state-of-the-art Voting classifier proposed by [19]. First of all, both Chimera and the Voting classifier use the same dataset, Omnidroid, introduced in the same paper [19]. The Voting classifier includes a Random Forest classifier for the static features and a Bagging classifier for the dynamic features. Each classifier contributes to the final classification result according to their weights chosen using model selection. Using 100 estimators for each classifier, the Voting classifier achieved an Accuracy of 0.897 ± ( 0.008 ) and a Precision of 0.897 ± ( 0.007 ).

Limitations and Future Work
Chimera is, by definition, a binary classifier designed for Android malware detection. Malware analysts might take advantage of knowing to which family a malware belongs to perform incident response more effectively. To accomplish that, Chimera will be extended to malware multiclass classification in future work.
Chimera was explicitly designed for Android malware detection; however, our experiments indicate that it is possible to extend it to Windows malware detection and classification.
Finally, Chimera, Chimera-S, Chimera-R, and Chimera-D are black-box methods. Malware analysts might take advantage of knowing why an instance was detected as malware and why it belongs to a particular family. Future work will investigate DL interpretability methods and how to apply them to Chimera.

Related Work
The first Android malware detection method based on multimodal deep learning was introduced by [9]. The authors used Static Analysis data extracted from different sources: Android Manifest files, decompiled DEX files and disassembled shared libraries. A multimodal DL architecture based on several DNNs was proposed for feature extraction, and an intermediate fusion layer was introduced to build shared representations that can be used for malware detection. In addition, the authors also introduced two types of feature vector generation methods: existence-based and similarity-based. [10] introduced a multimodal DL method that implements an intermediate fusion layer composed of features extracted from Static Analysis data: Permissions and Hardware Features. DNNs were used to implement both the feature extractors and the classifier. The authors evaluated the performance of using each modality separately and in the multimodal setting. [11] proposed a multimodal DL method that implements an early fusion layer composed of features extracted from Static Analysis data: Android Manifest files and Java API modules. Moreover, the authors also included the Sigpid (Significant Permission Identification) as a feature since SigPid was evaluated to have high discriminative power. Finally, a DNN was introduced for feature extraction and malware detection. [12] introduced a multimodal DL method to extract features from Static Analysis data: Android Manifest files and decompiled DEX files. The multimodal DL architecture is based on CNNs for feature extraction, an intermediate fusion layer to build shared representations of the extracted features and a DNN for malware detection. Also, the authors designed a backtracking module for an interpretable explanation. [13] developed a multimodal DL method to extract features from Static and Dynamic Analysis data and proposed a multimodal DL architecture based on several DNNs and an intermediate fusion layer to build the shared representations that can be used for malware detection.