Heterogeneous Graph Convolutional Networks for Android Malware Detection using Callback-Aware Caller-Callee Graphs

The popularity of the Android Operating System in the smartphone market has given rise to lots of Android malware. To accurately detect these malware, many of the existing works use machine learning and deep learning-based methods, in which feature extraction methods were used to extract ﬁxed-size feature vectors using the ﬁles present inside the Android Application Package (APK). Recently, Graph Convolutional Network (GCN) based methods applied on the Function Call Graph (FCG) extracted from the APK are gaining momentum in Android malware detection, as GCNs are effective at learning tasks on variable-sized graphs such as FCG, and FCG sufﬁciently captures the structure and behaviour of an APK. However, the FCG lacks information about callback methods as the Android Application Programming Interface (API) is event-driven. This paper proposes enhancing the FCG to eFCG (enhanced-FCG) using the callback information extracted using Android Framework Space Analysis to overcome this limitation. Further, we add permission - API method relationships to the eFCG. The eFCG is reduced using node contraction based on the classes to get R-eFCG (Reduced eFCG) to improve the generalisation ability of the Android malware detection model. The eFCG and R-eFCG are then given as the inputs to the Heterogeneous GCN models to determine whether the APK ﬁle from which they are extracted is malicious or not. To test the effectiveness of eFCG and R-eFCG, we conducted an ablation study by removing their various components. To determine the optimal neighbourhood size for GCN, we experimented with a varying number of GCN layers and found that the Android malware detection model using R-eFCG with all its components with four convolution layers achieved maximum accuracy of 96.28%.


I. INTRODUCTION
Android is a popular smartphone Operating System that powers around 70% of the smartphones and tablets worldwide [33]. Its popularity has long attracted a large amount of malware into its ecosystem [25] [31], threatening the privacy and security of its users. Three analysis techniques are prevalent to detect Android malware -static, dynamic and hybrid analysis [29]. In static analysis, features are extracted from the Android Application Package (APK) file without executing it. The dynamic analysis executes the APK inside a sandbox and extracts run-time features. The hybrid analysis is a combination of the above. Although obfuscation techniques can hinder static analysis [38], it is substantially faster than its counterparts.
The APK file provides several features to perform static analysis. The features such as permissions and intents can be extracted from the manifest file, which are the indicators of the behaviour of the Android application (app) [6] [18] [34] [1]. Apart from them, features such as sensitive Application Programming Interface (API) calls [1], API call graph [12] and Function Call Graph (FCG) [23] [39] [16] can be extracted from the Dalvik Executable (dex) code. Out of these features, FCG captures the structure of interactions between the methods of the app. The FCG is a directed graph with methods in the dex code as its nodes; its edges represent Caller-Callee relationships between the methods. If every node of the FCG is assigned features that represent its behaviour, it can capture the behaviour of an app as a whole [35].
The methods contained in the dex code can be internal or external depending on whether their implementation is contained in the dex code or not [35]. In general, the API methods (the Framework Space, F) are external, while Userdefined methods (the Application Space, A) are internal. As FCGs are extracted entirely using the information present in the dex code, interactions from the Framework Space to the Application Space cannot be captured [10]. This information is crucial as the Android API is heavily event-driven. In Android event architecture, event handlers are implemented as Application Space callback handlers, which are the children of Framework Space callback methods. The Framework Space is made aware of callback handlers using registration methods, which are also a part of the Framework Space [10]. FCG is unable to capture the relationship between registration methods and callback handlers. The Framework Space has to be analysed to include such relationships, and its results have to be used while constructing the FCG [10] [13].
Graph Convolutional Networks (GCNs) [20] have become a natural choice to perform deep learning on graphs because of their flexibility [43]. GCNs process graphs by aggregating neighbourhood information, updating a node's features based on it and fine-tuning its learnable parameters for a particular task. An n-layer GCN aggregates features into a node from its n-hop neighbourhood. A global pooling operation on the graph is used to obtain the feature vector representing the graph. This vector can then be used for downstream tasks such as classification.
In this work, we analyse Framework Space code to extract Registration-Callback map motivated by the approach of [10]. We also consider the mapping of permissions required by an API method from [7]. This information is utilised while analysing APKs to convert FCGs extracted from them into enhanced-FCGs (eFCGs). The reduced-eFCG (R-eFCG) is then obtained by contracting nodes of eFCG in an approach similar to MaMaDroid's [26]. Separate heterogeneous GCN models are then trained on eFCG and R-eFCG to evaluate their effectiveness.
We answer the following research questions in this paper: 1. Which components of eFCG and R-eFCGs are essential in Android malware detection using heterogeneous GCNs? 2. Can R-eFCGs achieve better generalisation in terms of Android malware detection rate than eFCGs? 3. What is the optimal neighbourhood size n for GCNs to detect Android malware using eFCG and R-eFCGs? To answer these research questions, we experiment with different components of eFCG and R-eFCGs to determine their contribution to the performance of the Android malware detection model. We also train separate models on eFCGs and R-eFCGs to access their generalisation ability. To determine the choice of optimal neighbourhood, we conducted a set of experiments by varying the number of GCN layers. As a result of these experiments, we obtained a maximum accuracy of 96.25% with R-eFCGs with all components and four GCN layers.
The key contributions of the present work are as follows: 1. We define eFCG and R-eFCG, containing the callback information and permission mappings along with the Caller-Callee information, and provide algorithms to obtain the same. 2. We conducted an ablation study to find essential components of eFCG and R-eFCG and found that all their components are essential. 3. We monitor the impact of the number of heterogeneous GCN layers on the performance of the Android malware detection model and found that its performance increases with the increasing number of layers. The rest of this paper is organised as follows: Section II demonstrates a simple app and its FCG used throughout this paper. Several relevant related works are discussed in Section III. Section IV provides an overview of mathematical concepts used in this paper. The Algorithms to obtain eFCG and R-eFCG, along with the architecture of the Android malware detection approach, are described in Section V. The experimental framework to evaluate the current work and its results are discussed in Section VI. Finally, the paper is concluded in Section VII along with discussing future directions.

II. MOTIVATION
A simple app containing a button (class Button) and a text view (class TextView) has been used to demonstrate the FCG and its enhancements throughout this work. When the user clicks on the button, the app starts tracking their location in the background and logs it periodically to the text view. Its source code and the FCG are shown in Figure 1. where the registration methods Button.setOnClickListener() (line 45 in Figure 1a) and LocationManager.requestLocationUpdates() (line 26-31 in Figure 1a) are not connected to their callback handlers onClick() (line 20 in Figure 1a) and onLoca-tionChangeed() (line 7 in Figure 1a), respectively, in the FCG.
To include relationships between registration methods and associated callback handlers, the Framework Space has to be analysed to obtain a mapping between all possible registration and callback methods. This list has to be used while analysing the APK file to identify the implementation of callback methods as callback handlers and associate them with their registration methods. This association has to be represented with a different edge type in FCG, as it is different from regular caller -callee edge type. contain many nodes and edges, depending on the size of the APK [23], which potentially affects the ability of the malware detection model to generalise well, promoting the need for reducing its size [26].

III. RELATED WORK
The manifest can be used as a feature source to detect malware. For example, the permissions and intents were used to detect Android malware in [6] [18] [34] [1]. However, permissions extracted from the manifest are not conclusive, so that they need not be used in the app despite being declared, which could lead to false positives during malware detection [11].
The dex code describes app behaviour. [30] represented the raw bytecode in the dex code in the form of a fixedsize image and provided it as the input to the Convolutional Neural Network-based Android malware detector. However, such representations completely ignore the structural information contained in the dex code, along with being prone to resizing losses.
The graphs extracted from dex code retain the structural information contained in the dex code. Out of these graphs, the API Call Graph captures the call order between the API methods. The API Call Graph was used as a feature source in [12], [28]. API Call Graphs are easy to work with as their maximum size is known since the number of API methods is fixed. However, they cannot capture the complete behaviour of the app as they ignore user-defined methods.
FCGs capture Caller-Callee relationships between every method of the dex code and are of huge magnitude [23]. Therefore, they were not used as a whole in many works [23] [39] [16]. Of such works, [39] use centrality measures of API methods, while [16] use graphlet frequency distribution as the feature vectors. Classifiers were trained with these feature vectors to detect Android malware. However, these works are limited because they only consider the structure of FCG, ignoring features that can be derived from method nodes. [23] was one of the earliest works to associate node features with every node of the FCG. The feature vector was derived from the opcodes of the method, on which 1-hop neighbour XOR aggregation was performed. The aggregated feature vectors were clustered using k-Nearest Neighbours to obtain cluster centres. These centres were used as the graph level feature vector. This approach is similar to the working of (1-layer) GCN in terms of neighbourhood aggregation, but it ignores the presence of API nodes completely, which are essential in accessing the behaviour of an app.
GCNs were used to detect Android malware based on FCGs in [9] [40] [35]. Of them, [9] used a word embedding based on method name was used as a node feature of the FCG and conducted experiments on an imbalanced dataset. The node features based on the method name are inefficient in the presence of obfuscation, as it mangles the method names. Also, imbalanced datasets are known to induce biases on the performance of GCN-based classifiers [35].
API node subgraph of FCG, with centrality measures as node features, were considered in [40]. This graph was the input to the GCN-based classifier. However, similar to [12] and [28], [40] cannot completely characterise the behaviour of the APK.
GCNs considering both API nodes and User nodes were used in [35]. Observing the potential biases that can be induced to GCN models in terms of an imbalanced dataset and imbalanced node size distribution among the classes, [35] proposed a dataset balancing algorithm. [35] consider both API nodes and user nodes as a single type and ignore callback information. The present work builds on the approach of [35] by treating API nodes and user nodes as different types, thus using heterogeneous FCGs. The present work also adds permission nodes to the FCG by considering API-Permission mapping and adds callback information to the FCG. This work adopts the node features of API and user nodes from [35].
EdgeMiner [10] proposed a Framework Space code analysis method to extract Registration-Callback pairs and added it to the FCG to extract potentially harmful paths in it. [13] provided a more granular view of the callback methods, considering their conditions to be called back. A similar analysis was performed in [7] to extract the API method -Permission mappings. However, none of these works built an end-to-end malware detection pipeline.

IV. PRELIMINARIES
This work uses several mathematical structures such as sets, functions and graphs. This section provides an overview of them along with discussing the structure of the dex code. Table 1 presents an overview of the notations discussed in this section.

A. MATHEMATICAL COLLECTIONS
A set is an unordered collection of unique elements. If elements are allowed to be present in it multiple times, it is called a multiset.
A (binary) relation R between two sets A and B is a subset A function can be thought of an association between a value x ∈ A and a single value The set of values (a single value, in case of maps) associated with x ∈ X by f is represented as f (x). The domain of the function f (represented as dom(f )) is the set of values x for which f (x) is defined (in above examples, dom(f ) = A). Interested readers are referred to [21] for further information about set theory. Attribute function for node type τ .

Aτ
Attribute space of the node type τ . Neighbourhood of node v in the graph G with node type τ .
Set of children of the node.
Set of successors of node v in a DAG G. The dex code C Set of classes defined and referenced in the dex code. M Set of methods defined and referenced in the dex code.

isF (·)
Is the flag F is present in the definition of its argument.

methods(c)
Set of methods of the class c.

class_parents(c)
Set of parent classes of the class c.

constructors(c)
Set of constructors of the class c.

argumentTypes(m)
Set of arguments of the method m.

class(m)
The class to which the method m belongs to.

sig(m)
Signature of the method m. Γ The FCG. I The Inheritance Graph.
Method-level graphs and edges.
Class-level graphs and edges.

B. GRAPHS
A path p is a sequence of edges e 1 → e 2 → · · · → e n where every edge e i = (u i , v i ) is distinct and u i = v i−1 ∀i > 1. Two nodes x and y are connected in G if there is a path between them. A graph is acyclic if there are no paths in G such that u 1 = v n . A Directed Acyclic Graph (DAG) is a graph which is both directed and acyclic. As all these graphs consist of a single type of nodes and edges, they are homogeneous. Interested readers are referred to [21] for further information about graphs.
If a graph contains multiple types of nodes or edges (or both), it becomes heterogeneous. Heterogeneous graphs occur naturally in many fields such as Recommender Systems [36] [41] and Bioinformatics [22]. The concept of heterogeneous graphs is illustrated here in light of the FCG in Figure 1b, which contains nodes of Application Space A and Framework Space F. Formally, a directed heterogeneous • V is the set of node types (e.g, {A, F}), • E ⊆ V 2 is a multiset of edge types, each associated with a name (e.g, {calls : (A, A), calls : (A, F)}, • V = ∪ τ ∈V V τ is the set of the nodes and, • E = ∪ t∈E E t is the set of edges. An edge set can be denoted by the name of its type followed by the nodes it connects to (e.g., E calls:A →F ). The names of the nodes can be omitted if no other edge with the same name is present in the edge types.
A heterogeneous graph becomes undirected if ∀t ∈ The structure of the heterogeneous graph is represented as a multigraph G M (V, E) called as the metragraph of G. Figure 2 shows the metagraph of the FCG shown in Figure 1b. Interested readers are referred to [42] for further information about heterogeneous graphs.
If every node of the graph is associated with some attributes, the graph is called an attributed graph. The attribute function A τ : V τ → A τ defines the attributes for each node v ∈ V τ of type τ in attribute space A τ . For homogeneous graphs, there is only one attribute space.
For a graph G with nodes V and edges E, we use following notations to denote the information about neighbourhood of v ∈ V with node type τ ∈ V in G: is the set of its 1-hop neighbours with type τ . If G is a Homogeneous DAG, pred G (v) = {u|u and v are connected in G} is the set of predecessors of v, succ G (v) = {w|v and w are connected in G} is the set of successors of v. In every notation, the subscript G is omitted when the graph in question can be inferred from the context and, type subscript τ is omitted if the graph is homogeneous or when we refer to nodes with all types.

C. THE DEX FILE
The classes.dex present inside the APK contains the application logic represented as the dex code, to be executed by Android Runtime [32]. Android API is also bundled in several dex files, residing in /system/framework/framework.jar in the case of Android 11.
By parsing the dex code, one can obtain the sets of classes C and methods M implemented and referenced within its scope. Note that the interfaces and enums are treated as classes, and the constructors are treated as methods in the dex code. The definition of every class c ∈ C and method m ∈ M associates them with several flags. These flags include modifier information (e.g., public, static and abstract) and the declaration type in the code (e.g., interface, enum and constructor). We define a Boolean function isF (·), which returns true whenever the flag F is present in the definition of its argument.
Apart from the flags, the definition of the class c includes a list of its methods (methods(c)), along with a list of VOLUME 4, 2016 its parents in the inheritance hierarchy (class_parents(c)). The constructors of c can be obtained by filtering its methods with the flag constructor, i.e., constructors(c) = {m | m ∈ methods(c) ∧ isConstructor(m)}. Similarly, the definition of method m includes the types of its arguments (argumentTypes(m)) and a reference to the class to which it belongs to (class(m)). Multiple methods in a class can have the same name due to method overloading; thus, the method name along with its argument type list (the signature, denoted by sig(m)) is unique for every method.
If a method m is internal, the dex code includes its bytecode in the dex format. The bytecode consists of a sequence of instructions, with each instruction containing an opcode and operand(s). Each opcode is 8-bit in length, making 256 opcodes possible, of which only 230 are used [15]. As many of the opcodes do a similar task (ex., opcode range 0x90-0xE2 consists of binary operations such as add, sub and mul), they can be grouped based on their functionality. While [23] constructed 15 opcode groups, [35] constructed 21 opcode groups. This work uses opcode groups of [35]. Interested readers are referred to the Dalvik Specification [32] to get more information about the dex code.
Using the relationships among the methods M and classes C contained in the dex code, several graphs can be constructed. Out of them, the Class-level Inheritance Graph indicate that the edges are among methods and class nodes, respectively. If the superscript is not present in the graph name (e.g., Γ and I), they are assumed to method-level.

V. PROPOSED APPROACH
The proposed Android malware detection approach consists of two analysis stages -Framework Space Analysis and Application Space Analysis, followed by a Heterogeneous GCN based Android malware detection model. The Framework Space Analysis is done once, and its outputs are re-used in the Application Space Analysis for every app. Separate Heterogeneous GCN models are trained for eFCG and R-eFCG obtained by the Application Space Analysis. The following sections describe every stage in detail.

A. FRAMEWORK SPACE ANALYSIS
The Framework Space Analysis analyses the Android Framework to extract a mapping between Registration and Callback methods. To do so, the dex file containing Framework Space code has to be parsed to get the set of Framework Classes C F and Framework Methods M F . From C F and M F , the Framework Space Inheritance Graph I F and Framework Space FCG Γ F are obtained, respectively. The approach of [10] is adopted to extract potential callback methods from the Framework Space code, which are then filtered to obtain final callback methods along with corresponding registration methods. The architecture of Framework Space Analysis is shown in Figure 3.

1) Potential Callback Filter
A potential callback method is a Framework Space method which is visible to the Application Space and can be overridden by it. For a method m with c = class(m), if all of the following criterion are satisfied, then it becomes a potential callback method [10]: Criterion 1, 2 and 3 ensure that the class c is visible to Application Space classes and can be extended; Criteria 4 ensures that the method m can be overridden in Application Space. As all interface methods are public by default, Criteria 4 is true for them. P denotes the set of all potential callbacks.

2) Registration-Callback Map Extraction
A method m being potential callback does not guarantee that its Application Space override m can be introduced back to the Framework Space through an Application Space visible registration method r and, subsequently called back by the Framework Space. Note that to introduce m to r, the method r must take an argument of type c = class(m), thus, accepting any instance of class c derived from c, overriding m in its method m .
To filter out the methods m whose overrides cannot be introduced to Framework Space, we use Argument Map. The Argument Map is a multimap α F : In other words, for a Framework Space class c, α F (c) is a set of methods M ⊂ M F , in which c is an argument of. If α F (c) = ∅ for c = class(m), then the class c cannot be passed back to the Framework Space, therefore all of its methods are not callback methods.
A registration method r taking an argument of type c need not necessarily invoke the method m of c. To check for the invocation of m, a complete reverse data-flow analysis tracking c until the invocation of m is required as in [10]. However, we empirically observe that the invocation of m happens in a method µ either belonging to u = class(r) or some nested class u of u most of the times. Therefore, the 1. c is an argument of some Application Space visible method r. i.e., ∃ r ∈ α F (c) s.t. isPublic(r) ∧ isPublic(class(r)), and, 2. Some method µ either belonging to u = class(r) or some nested class u of u invokes m in its code. If a method m satisfies above criterion, then the method r is the registration method of m and, the pair (r, m) is added to the Registration-Callback map R. The process of extracting the Registration-Callback map is summarized in Algorithm 1. I F ← Extract Inheritance Graph using C F See Section IV-C 3: Γ F ← Extract FCG using M F See Section IV-C 4: α F ← Extract Argument Graph using M F and C F See Section V-A2 5: P ← Extract Set of Potential Callbacks from M F See Section V-A1 6: C P ← ∅ Multimap of methods in P keyed by their class 7: for m in P do 8: if ∃u s.t. (class(m), u) ∈ α F then check if class(m) is used anywhere 9: C P ← C P ∪ (class(m), m) 10: end if 11: end for 12: R ← ∅ Registration Callback Pairs 13: for c in dom(C P ) do 14: R ← {(class(r), r) | r ∈ α F (c) ∧ isPublic(r) ∧ isPublic(class(r))} Multimap of possible registration methods for class c keyed by their classes 15: for p in C P (c) do Loop through Potential Callback methods p of class c 16: U ← {class(u) | u ∈ parents Γ F (p)} Set of classes that have at least one method calling p 17: Note that whenever r is a registration method, any Framework Space child r of r can be a registration method too, assuming that r invokes r with its parameters. Therefore, (r, p) ∈ R =⇒ (r , p) ∈ R. As adding (r , p) to the Registration-Callback map R increases the size of R significantly, Framework Space Inheritance Graph I F is provided to Application Space Analysis to infer such relationships.

B. APPLICATION SPACE ANALYSIS
Application Space Analysis extracts the dex file from the APK and parses it to get the set of Application Space classes and methods C A and M A , respectively. Note that the C A (and M A ) includes the classes (and methods) implemented in Application Space C A (M A ), along with the reference to classes (methods) from Framework Space The metagraph {Γ e } M of the eFCG is shown in Figure  5. The Application Space Analysis further reduces eFCG into R-eFCG Γ (C) e using eFCG reducer. These stages of the Application Space Analysis and the nodes and edges they add to the FCG are described in detail in the following paragraphs. VOLUME   The event handlers are implemented in the Application Space as an overridden method of a Framework Space callback method. The FCG cannot capture this information as the event handler does not call its parent callback method most of the time.
To add the relationship between event handler and its parent callback method to the FCG, the inheritance hierarchy has to be considered. By adding the edges in E parentOf:F →A to represent these cases, respectively.

2) Callback Edges Adder
The registration methods and the callback methods are not related in the FCG, as their Caller-Callee relationship cannot be inferred without the help of the results of Framework Space Analysis.
The Registration Callback map R can be used to add edges between the registration methods and the corresponding callback methods. As the Framework Space inheritance information is not contained in I A (thus in E

3) Permission Nodes Adder
The manifest file contains a list of permissions that are required by an app to run. As it is possible to request permission and not use it [11], permissions required by used Framework Space methods can be used to get a list of actual permissions needed. Axtool [7] provides a mapping Ψ : M F → P between the Framework Space methods M F and Permission Space P. For a Framework method m ∈ M F , Ψ(m) is the set of permissions that is required by m.
The permission nodes P and the edges E (M) requires to be added to the FCG are calculated using (2) and (3), respectively.
Underlined nodes in Figure 6a represent the permission nodes P and bold solid edges represent the edges in E (M) requires . Note that the edges in E (M) requires are always from the Framework Space nodes F to the Permission nodes P.

4) Node Attributes Assigner
After adding inheritance edges, callback edges and permission nodes and edges, the FCG becomes heterogeneous. The nodes of it consist of Application Space methods M A , Framework Space methods M F , and Permissions P . For every node, the attributes are assigned using attribute scheme A as follows: • For Framework Space nodes m, A F (m) is a one-hot vector describing the position of the API package to which m belongs in the API packages list obtained from [4].

• For Application Space nodes m, A A (m) is a 21-bit
Boolean vector representing the opcode groups that are used in its body as in [35]. • For Permission nodes p, A P (p) is a concatenation of one-hot vector of the group that it belongs to, and a bit indicating whether it is dangerous or not [5]. Note that the package list in [4] is in alphabetical order. This alphabetical order is not mandatory as long as the same package indices are used during training and testing. The attributes are assigned as a vector h (0) i for every node i.

5) eFCG Reduction
As the eFCG Γ e contains large number of nodes and edges, the ability of the Android malware detection model to generalise might be limited [26]. To overcome this problem, the method nodes (F and A) are contracted based on their classes as in MaMaDroid [26] to get Reduced-eFCG (R-eFCG). Formally, the R-eFCG is Γ  Figure 1a is illustrated in Figure 6. Observe that the registration methods and callback handlers are connected through callback methods in the eFCG.

C. GCN CLASSIFIER
The GCN classifier consists of several Heterogeneous GCN layers, each containing a GCN module for every edge type. The eFCG and R-eFCGs are converted into undirected graphs before providing them as the inputs to the GCN classifier by adding a reverse edge type for every edge type present in the graph to ensure that data flow happens between every node type.
Every GCN module is implemented using GraphConv algorithm [20]. At every layer l, the hidden representation h (l+1) i of node i with type s ∈ V is first calculated using GCN module of edge e where e = (s, τ ), e ∈ E using following operations: where, is the normalisation coefficient between node i and node j, σ is an activation function (ReLu in this work), W (l) and b (l) are the weight and bias matrices at layer l, respectively. The edge type-wise GCN operation is represented in (4), which are aggregated in (5). The normalisation coefficient is calculated in (6) to limit the magnitude of the aggregated features. After n convolution layers, the node features of node type τ are aggregated using a readout operation as in (7) (mean in this work) to get h τ . The readout features for all node types τ ∈ V are then concatenated (operator ||) in (8) to get a graph-level embedding vector h.
The graph embedding h can be passed to any downstream task. This work uses a 1-layer fully connected neural network followed by the sigmoid activation function as the classifier. Thus, the probability of a given eFCG (or R-eFCG) is from malware APK can be given using (9), where W and b are the weight matrix and bias of the classifier, respectively. For classification purposes, if P > 0.5, the sample is regarded as malware, otherwise benign.

VI. EXPERIMENTS, RESULTS AND ANALYSIS
The experiments to answer the research questions posed in section I are described in this section, along with the configurations.

B. DATASETS USED
Maldroid2020 [24] and AndroZoo [2] datasets were used to build the model. The dataset balancing approach of [35] was applied on Maldroid2020, with adding additional APKs from AndroZoo. The final dataset was balanced both in terms of the number of APKs and node count distribution and contained a total of 11760 APKs. The dataset was divided into training, validation, and testing splits with a ratio of 60%, 20% and 20%, respectively, while ensuring that the node count distribution of all splits remained the same.

C. TRAINING CONFIGURATION
Every GCN model was trained using the Binary Cross-Entropy loss function, as the model was learning a probability distribution. Adam Optimiser [19] was used to optimise the parameters of the model as it performs better than other optimisers, even in its default configuration (learning rate=10 −3 ). The maximum number of epochs was set to 100, and the model at epoch e having minimum validation loss was chosen for testing.

D. EXPERIMENTS 1) Ablation Study
To determine the essential node types of eFCG and R-eFCG, ablation study was conducted by restricting the node types V to -code = {A}, core = {A, F} and all = V def = {A, F, P}. The GCN models are trained and tested using this reduced set of node types. Note that the Application Space nodes A are present in all sets as they contain crucial logic that can be used as a behaviour indicator of an app. Thus, the ablation study aims to test whether Framework and Permission nodes improve the performance of the model significantly.

2) Neighbourhood Analysis
Every node configuration was trained with a variable number of Heterogeneous GCN layers starting from n = 0 to test whether an increasing number of GCN layers (thus, a larger neighbourhood) improves the Android malware detection model performance. The case of n = 0 represents the set of baseline Android malware detection models in which aggregated and concatenated node attributes are directly passed as the input to the sigmoid classification model.

3) Generalisation Analysis
Every configuration in the ablation study and neighbourhood analysis was conducted for eFCG and R-eFCG by training separate GCNs. These experiments were used to determine whether R-eFCG performs better than eFCG, implying its ability to generalise.

E. EXPERIMENTAL RESULTS AND ANALYSIS
Summary of obtained experimental results is shown in Table  2. From it, several insights about research questions can be drawn, which are discussed in the following sections:

1) Effectiveness of Node types
With Application Space nodes A only, the model was able to achieve a mean accuracy of 84.63% with a standard deviation of 7.79%. With the addition of framework space nodes F, the mean accuracy was increased by 6.86%, reaching 91.49% with a standard deviation of 4.29%. The addition of permission nodes slightly improved the mean accuracy by 1.58%, making the model achieve a mean accuracy of 93.07% with a standard deviation of 4.02%. The trend of increasing accuracy with the addition of node types is shown in Figure 7. These results emphasise that the Framework Space nodes are crucial to detect Android malware. Similarly, the contribution of permission nodes to the performance of the model is essential, although they are less in number.

2) Effect of neighbourhood size n
With n = 0, the baseline models performed better than a random-guess model obtaining a mean accuracy of 80.35% with a standard deviation of 7.60%, suggesting that the node attributes play an essential role to detect Android malware. Subsequent addition of GCN layers improved the mean accuracy by 9.29%, 2.49%, 0% and, 1.27%, respectively. No performance improvements were observed during the addition of the third GCN layer for "core" and "all" configurations. The addition of the fourth GCN layer did not improve the accuracy by a significant amount. The variation of accuracy with the addition of GCN layers shown in Figure  8 suggests that n = 2 is a sweet spot between accuracy and inference time, as the number of GCN layers directly affect  the inference time.
3) Generalisation ability of R-eFCG R-eFCGs performed better then eFCGs all node configurations as evident from Table 2. A statistical analysis of the accuracies obtained with eFCGs and R-eFCGs suggest that the R-eFCGs improve the mean accuracy by 2.35% with a standard deviation of 1.25%. Minimum improvements less than 1% were observed with n = 4 and node configuration "code" and "all" along with n = 2 with node configuration "all". These results suggest that R-eFCGs can generalise better than eFCGs in most cases. In the sweet spot n = 2 with node configuration all, R-eFCGs can be used as a replacement to eFCGs, thus making inference faster, as they have fewer nodes than eFCGs. Note that R-eFCG Γ (C) e has to be calculated after Γ e (see Section V-B), thus adding additional computational step. However, the procedure of section V-B can be easily tuned to output R-eFCGs instead of eFCGs by considering classes instead of methods and using their VOLUME 4, 2016 attribute schemes.

Comparison with Related Works
The "core" configuration of this work using eFCGs is conceptually similar to FCGs used in [35]. While [35] reported accuracy of 92.29% with 3 GCN layers, the "core" configuration using eFCGs with n = 3 achieved a similar accuracy of 92.08%. The proposed method could not be compared with [9] [40] as they did not incorporate any node-count distribution balancing strategies and did not disclose their dataset.

VII. CONCLUSIONS AND FUTURE WORK
This paper proposed an Android malware detection approach based on the heterogenous Caller-Callee graphs extracted from the APK files. First, the heterogeneous graphs eFCG and R-eFCG were defined, and algorithm to obtain the same were discussed. These graphs incorporate the information about callback and permissions obtained by the Framework Space Analysis. Then, separate heterogeneous graph models were trained on them to evaluate their performance. Finally, the experiments to determine optimal neighbourhood and essential components of heterogeneous graphs were also conducted. As a result of these experiments, a maximum accuracy of 96.28% was obtained.
There is further scope to improve this work in multiple directions. During Framework Space Analysis, the algorithm to find Registration-Callback map can be made more exact, and the difference of their results with our approximate method can be compared and contrasted. In Application Space Analysis, the nodes can be assigned more informative features, such as package name-based embedding for Framework Space nodes and opcode sequence embedding for Application Space Nodes. Finally, explainability methods can be integrated with the GCN models to identify and understand critical nodes that contain malicious code.