Annotative Software Product Line Analysis Using Variability-Aware Datalog

Applying program analyses to Software Product Lines (SPLs) has been a fundamental research problem at the intersection of Product Line Engineering and software analysis. Different attempts have been made to “lift” particular product-level analyses to run on the entire product line. In this paper, we tackle the class of Datalog-based analyses (e.g., pointer and taint analyses), study the theoretical aspects of lifting Datalog inference, and implement a lifted inference algorithm inside the Soufflé Datalog engine. We evaluate our implementation on a set of Java and C-language benchmark annotative software product lines. We show significant savings in processing time and fact database size (billions of times faster on one of the benchmarks) compared to brute-force analysis of each product individually.


INTRODUCTION
S OFTWARE Product Lines (SPLs) are families of related products, usually developed together from a common set of artifacts. Each product configuration is a combination of features. In particular, annotative SPLs use artifact annotation techniques (e.g., the C Pre-Processor (CPP)) to define which set of product configurations each artifact belongs to. The number of potential products is typically combinatorial in the number of features because each feature can be present or absent in each product configuration. This high level of configurability is usually desired. For example, the opensource Linux kernel is an annotative product line of more than 10,000 features [1]. Different variants of the kernel can be generated at build-time by providing different feature combinations (subject to feature dependencies). However, analysis tools (syntax analyzers, type checkers, model checkers, static analysis tools, etc...) typically work on a single product, not the whole annotative SPL. Applying an analysis to each product separately is usually infeasible for non-trivial SPLs because of the exponential number of products [2].
Analysis tools are valuable during the initial implementation of a product line, and they are even more valuable detecting modifications that break existing functionality during product line maintenance. Since all products of an annotative SPL share a common set of artifacts, analyzing each product individually (usually referred to as brute-force analysis) would involve a lot of redundancy. Leveraging this commonality and analyzing the whole product line at once in order to bring down the total analysis time is a fundamental research problem at the intersection of Product Line Engineering and software analysis. Different attempts have been made to lift individual analyses to run on product lines [3], [4], [5], [6], [7], [8], [9]. Those attempts show significant time savings when the annotative SPL is analyzed as a whole compared to the brute-force analysis. The downside of lifting individual analyses though is the amount of effort required to correctly lift each of those analyses.
In this paper, we tackle the class of Datalog-based program analyses. Datalog is a declarative query language that adds logical inference to relational queries. Some program analyses (in particular, pointer and taint analyses) can be fully specified as sets of Datalog inference rules. Those rules are applied by an inference engine to facts extracted from a software product. Results are more facts, inferred by the engine based on the rules. The advantage of Datalog-based analyses is that they are declarative, concise and can be efficiently executed by highly optimized Datalog engines [10], [11].
Instead of lifting individual Datalog-based analyses, we lift a Datalog engine. This way any analysis running on the lifted engine is lifted for free. Our approach is not specific to a particular engine though, and can be implemented in others.
Contributions. In this paper, we make the following contributions: (1) We present d infer, a Datalog inference algorithm lifted to facts extracted from annotative Software Product Lines. (2) We state the correctness criteria of lifted Datalog inference and show that d infer is correct. (3) We implement our lifted algorithm as a part of a Datalog engine. We also extend the Doop pointer analysis framework [12] to extract facts from annotative SPLs. (4) We evaluate our implementation on a sample of pointer and taint analyses applied to a suite of Java benchmarks, and a set of hand-crafted analyses of C programs applied to the Busy-Box product line. We show significant savings in processing time and fact database sizes compared to brute-force analysis of one product at a time. For one of the benchmarks, our lifted implementation is billions of times faster than bruteforce analysis (with savings in database size of the same order of magnitude).
In this paper, we are extending and revising our ESEC/ FSE conference paper [13]. In particular: (1) We evaluate our approach on a wider range of product line benchmarks written in Java and C-language, and using different feature annotation mechanisms; (2) We discuss and illustrate the distinction between annotation mechanisms (C PreProcessor (CPP) and CIDE), with emphasis on its impact on variability-aware fact extraction; (3) We discuss variabilityaware fact extraction in more detail; (4) We add more elaborate examples and illustrations throughout the paper.
The rest of the paper is organized as follows: We start with a background on SPLs and Datalog (Section 2). We provide a theoretical treatment of Datalog inference, how the inference algorithm is lifted, together with correctness criteria and a correctness proof in Section 3. In Section 4, we describe the implementation of our algorithm in the Souffl e engine. Evaluation process and results are discussed in Section 5. We compare our approach to related work in Section 6 and conclude (Section 7).

BACKGROUND
In this section, we summarize the basic concepts of Software Product Lines, Horn Clauses, Datalog and Datalog-based analyses.

Software Product Lines
A Software Product Line (SPL) is a family of related software products developed together. Different variants of an SPL have different features, i.e., externally visible attributes such as a piece of functionality, support for a particular peripheral device, or a performance optimization.
Definition 1 (SPL). An SPL L is a tuple ðF; F; D; fÞ where: (1) F is the set of features s.t. an individual product can be derived from L via a feature configuration r F . A feature configuration can be also represented as a propositional formula over F, conjoining the features included in the configuration with the negation of the features excluded. (2) F 2 PropðF Þ is a propositional formula over F defining the valid set of feature configurations. F is called a Feature Model (FM). The set of valid configurations defined by F is called ConfðLÞ. (3) D is a set of program elements, called the domain model. The whole set of program elements is sometimes referred to as the 150% representation [14]. (4) f : D ! PropðF Þ is a total function mapping each program element to a proposition (feature expression) defined over the set of features F . fðeÞ is called the Presence Condition (PC) of element e, i.e., the set of product configurations in which e is present.
There are different techniques for implementing feature variability in product lines. In annotative SPLs, the code of a single feature can be scattered across the product line code base, and code snippets belonging to that feature are explicitly annotated by some feature identifier. Compositional SPLs on the other hand modularize each feature as a separate unit, and then features are composed together based on the desired product configuration. Examples of compositional techniques are Aspect-Oriented Programming [15] and Feature-Oriented Programming [16]. Delta-oriented SPLs [17] are a third category of product lines, where a base implementation common across product variants is provided, and then feature-specific deltas with respect to that base implementation are provided. In this paper, we only focus on annotative SPL mechanisms.
We consider two variability-annotation mechanisms in this paper: C Pre-Processor (CPP) macro annotations, and CIDE [18] color-based annotations of source code.
CPP Example. The annotative Java product line in Listing 1 has the feature set F ¼ fFA; FBg. Features are annotated using the C Pre-Processor(CPP) conditional compilation directives. By defining or not-defining macros corresponding to features, different products can be generated from this product line. One example is the product on Listing 2, with FA not defined and FB defined.
Here Object r = o3.f; CIDE Example. The code snippet in Fig. 1 comes from the Graph Product Line (GPL) benchmark [20], with featurespecific code highlighted in different colors. In this example, the light-blue color denotes the EdgeObjects feature, while the violet color denotes the feature combination EdgeObjectŝ Weighted.
CIDE only allows well-behaved annotations, i.e., annotations enforcing the syntactic correctness and type-checking of each variant. For example, because the declaration of the weight field is highlighted in violet, all code snippets referring to this field are highlighted in the same color.
The Feature Model (FM) of a product line can be represented graphically in a Feature Diagram (FD). Fig. 2 is the FD of the GPL product line. Features (abstract and concrete) are recursively decomposed in tree-form, where all leaves are concrete features. Some features are marked as mandatory (they must exist in each product instance) and others as optional. Mutually-exclusive features are marked using the alternative notation. Inter-tree dependencies between features can be added on top of the tree structure of the FD using explicit propositional invariants (the propositional equalities below the diagram).

Horn Clauses
A Horn Clause (HC) is a disjunction of distinct propositional literals with at most one positive literal. For example, ð:a _ :b _ c _ :dÞ is an HC, which can be also written as a reverse-implication ðc ða^b^dÞÞ, where c is called the head and ða^b^dÞ is called the body of the clause. The language of HCs is a fragment of Propositional Logic that can be checked for satisfiability in time linear in the number of input facts, as opposed to general propositional satisfiability which is NP-complete [21].

Datalog
Datalog is a declarative database query language that extends relational algebra with logical inference [22]. Datalog inference rules are HCs in First Order Logic, where atoms are predicate expressions, not just propositional literals. A fact is a ground rule with only a head and no body. Syntactically, the ':-' symbol is usually used instead of backward implication, and atoms in the body are separated by commas instead of the conjunction symbol.
For example, given a binary predicate Parent, a binary predicate Ancestor can be defined using the following two rules: Ancestor(x,y) :-Parent(x,y) . Ancestor(x,z) :-Parent(x,y), Ancestor(y,z) .
That is, if x is a parent of y, then x is an ancestor of y. In addition, if x is a parent of y, and y is an ancestor of z, then x is an ancestor of z. Fig. 3a defines the grammar of Datalog clauses as follows: (1) building blocks are finite sets of constants, variables and predicate symbols; (2) a term is a constant or a variable symbol; (3) a predicate expression is an n-ary predicate applied to arguments; (4) a fact is a ground predicate expression, i.e.,  all of its arguments are constants; (5) a rule is a Horn Clause of predicate expressions, where each variable appearing in the head also appears in the body of the clause; and (6) a Datalog clause is either a fact or a rule.
A Datalog program is a finite set of rules, usually referred to as the Intensional Database (IDB), which operates on a finite set of facts called the Extensional Database (EDB). The inference algorithm (explained next) repeatedly applies the rules to the facts, inferring new facts and adding them to the EDB, until a fixed point is reached (i.e., no more new facts can be inferred).

Inference Algorithm (Algorithm 1)
For each rule R, the algorithm checks to see if the EDB has facts fulfilling the premises of R, with a consistent assignment of variables to constants (Fig. 3b). If it does, the head of that rule is inferred as a new fact F . If F doesn't already exist in the EDB, it is added to it. Newly inferred facts may trigger some of the rules again; this process continues until a fixed point is reached, i.e., no new facts are inferred. This algorithm (called forward chaining [23]) is guaranteed to terminate because it does not create any new constants, and runs in polynomial time w.r.t. the number of input clauses [23]. Different Datalog engines (e.g., Souffl e [10], LogicBlox [24]) implement different kinds of data-structure and algorithmic optimizations to lower the computational cost of each iteration in the inference algorithm.

Example of a Datalog Analysis
Some program analyses [12], [25], [26], [27] can be written in Datalog as sets of clauses. Facts relevant to the analysis are extracted from the program to be analyzed, and then fed into a Datalog engine together with the analysis clauses. Fact extraction is usually analysis-specific because different analyses work on different aspects of the program. One example of Datalog-based analyses is pointer analysis. Pointer analysis [28] determines which objects might be pointed to by a particular program expression. This wholeprogram analysis is over-approximating in the sense that it returns a set of objects that might be pointed to by each pointer, possibly with false positives. Fig. 4 shows a set of Datalog rules for a simple context-insensitive pointer analysis [19]. Each predicate defines a relation between different artifacts. VarPointsTo(v, h) states that pointer v might point to a heap object h. New(v, h) means an allocation of  a heap object h is assigned to a pointer v. Assign(u, v) captures an assignment from a pointer v to a pointer u. For object fields, HeapPointsTo(o, f, h) means that a field f of an object o might point to a heap object h. Load(u, v, f) represents a load of a field f of an object pointed to by v into a pointer u. Similarly, Store(u, f, v) represents a store of the object pointed to by v to a field f of the object pointed to by u.
The first three rules specify the conditions for this predicate to hold: either a new object is allocated and a pointer is initialized; a pointer that already points to an object is assigned to another pointer; or an object field points to a heap object, and that field is assigned to another pointer. The fourth rule states that assigning a value to an object field results in that field pointing to the same object as the right-hand-side of the assignment. Fig. 4b shows the facts corresponding to the program in Listing 2. The first two are object allocation facts; the third is an assignment fact, and the fourth and the fifth are store and load facts, respectively. Fig. 4c is the results of running the Datalog inference algorithm on those rules and facts. The example in Fig. 4a is called a context-insensitive pointer analysis because it does not distinguish between different objects, call sites and types in a class hierarchy. More precise context-sensitive pointer analyses take different kinds of context into consideration [28]. For example, a 1-call-site-sensitive analysis considers method call sites. A 1-object-sensitive analysis (similarly, 1-type-sensitive) includes object allocation sites (types of objects allocated) as part of the context.

LIFTING DATALOG
In this section, we present our approach to lifting Datalog abstract syntax and the Datalog inference algorithm. We also formally state the correctness criteria for lifted Datalog inference, and present a correctness proof of our lifted algorithm.

Annotated Datalog Clauses
When analyzing a single software product, an initial set of facts is extracted from product artifacts, and analysis rules are applied to those facts, eventually adding newly inferred facts to the initial set. In the case of SPLs, a fact might be valid only in a subset of products, and not necessarily the entire product space. We have to associate a representation of that subset with each of the extracted facts. Similar to SPL annotation techniques, a Presence Condition (PC) is a succinct representation that can be used to annotate facts. Fig. 3c presents the grammar of Datalog clauses, annotated with presence conditions over a set of features.
Facts annotated with PCs are called lifted facts, and are stored in a lifted Extensional Database -d EDB. Given a feature configuration r, we define d EDBj r (restriction) to be the set of facts from d EDB which only exist in the product configuration defined by r d EDBj r ¼ ff j ðf; pcÞ 2 d EDB^satðpc^rÞg: When the Datalog inference algorithm is applied to annotated facts, we have to take the PCs attached to facts into account. Whenever the inference algorithm generates a new fact, we need to associate a PC to it. The core of the lifted Datalog inference (Algorithm 2) is the lifted Modus Ponens inference rule ð d MP Þ in Fig. 5. If C is generated from premises F 1 ; F 2 ; . . .; F n , with PCs pc 1 ; . . .; pc n , then pc c attached to C should be the conjunction of the input PCs, i.e., pc 1^. . .p c n . Intuitively, pc c represents the set of products in which C exists, which is the intersection of the sets of products in which the premises exist.
To avoid having too many generated facts that are practically vacuous, we check pc c for satisfiability. If it isn't satisfiable, then its corresponding fact exists in the empty set of products, i.e., non-existent. Those facts can be safely removed from d EDB, optimizing the number of facts stored in the database, and potentially improving the performance of inference. There is a trade-off though between the computational complexity of satisifability checking (an NP-complete problem), and optimizing the database size. We address this trade-off in more detail in Section 5.1.3.

Lifted Inference Algorithm
The structure of this algorithm is similar to that of Algorithm 1, with the exception of conjoining the presence conditions of the facts used in inference according to d MP (Fig. 5), and assigning the conjunction as the presence condition of the result. There are four cases for ðc; pc c Þ to consider. (1) if satðpc c Þ is False (pc c is not satisfiable), then this result is ignored because it doesn't exist in any valid product; (2) if ðc; pc c Þ 2 d EDB, then this result is also ignored because it already exists for the same set of products; EDB. This means we are expanding the already existing set of products in which c exists to also include the set denoted by pc c . This accounts for the case where the products denoted by pc c are a subset of the products denoted by pc d . In such cases, the disjunction of pc c and pc d is equal to pc d , and the existing set of products is not expanded. Avoiding storing the same fact multiple times in the database, each time with a different presence condition, optimizes the database size, at the expense of the disjunction cost of those presence conditions. It also makes the addition of lifted aggregation functions (e.g., sum, count) more straightforward in the future; (4) if c doesn't exist at all in d EDB, we add ðc; pc c Þ to it. For example, when the lifted inference algorithm is applied to the rules in Fig. 4a and annotated facts in Fig. 6, the result is the following: The last two facts in the results show that the pointer r might be pointing to the object A when the feature FA is present, while it might be pointing to the object B if FA is absent.

Correctness Criteria
When applying the lifted inference algorithm d infer to a set of rules IDB and a set of annotated facts d EDB, we expect the result to be exactly the union of the results of applying infer to facts from each product individually. Moreover, each clause in the result of d infer has to be properly annotated (i.e., its presence condition has to represent exactly the set of products having this clause in their un-lifted analysis results). Theorem 1. Given an SPL L ¼ ðF; F; D; fÞ, a set of rules IDB, and a set of lifted facts d EDB annotated with feature expressions over F Proof. The proof breaks the equality into two implications of set membership (forward and backward). Both subproofs rely on the fact that d infer is monotonic in the sense that new facts are added to d EDB without removing any of the existing facts. The logic behind inferring a new fact is summarized by the d MP inference rule (Fig. 5). As stated in Algorithm 2 (lines [11][12], when an element is removed from d EDB, it is replaced by the same fact with an extended set of products. Also the algorithm is guaranteed to terminate because in each iteration either a new fact is added to the fact base, or the set of products of an existing fact is expanded. Because new facts do not introduce new symbols, this procedure will always reach a fixed point.
By structural induction over the derivation tree of C: Base Case: The base case is a clause C that already exists in the input of the lifted inference algorithm. In other words, for some presence condition pc, ðC; pcÞ 2 d EDB. Since C 2 d inferð d EDBÞj r then satðpc^rÞ and, subsequently, C 2 d EDBj r (by definition of restriction operator). Since inputs are already included in the output of infer, C 2 inferð d EDBj r Þ. Induction Hypothesis: For each rule R ¼ C :-L 1 ; . . .; L n in the IDB and a variable assignment g, Step: Since C 2 d inferð d EDBÞj r , there is a presence condition pc where ðC; pcÞ 2 d inferð d EDBÞ and satðpcr Þ (definition of restriction operator). Because C is derived using an n-ary rule R 2 IDB, where the unification of each of the premises L i of R using variable assignment g exists in d EDB with a presence condition pc i , then Since pc^r is satisfiable, then 8 i ; satðpc i^r Þ (each of the conjuncts has to be satisfiable). Then By structural induction over the derivation tree of C: Base Case: The base case is a clause C that already exists in the input of the lifted inference algorithm. In other words, for some presence condition pc, ðC; pcÞ 2 d EDB. Since C 2 inferð d EDBj r Þ, pc^r is satisfiable (definition of restriction operator). Then C 2 d inferð d EDBÞj r . Induction Hypothesis: For each rule R ¼ C :-L 1 ; . . .; L n in the IDB and a variable assignment g,

Induction
Step: Since C 2 inferð d EDBj r Þ, then C is derived using an n-ary rule R 2 IDB, where the unification of each of the premises L i of R using variable assignment g exist in inferð d EDBj r Þ. Then the unification of each of those premises using the variable assignment g also exists in d inferð d EDBÞj r (induction hypothesis). For each premise ð½gL i ; pc i Þ, pc i^r is satisfiable (definition of restriction operator). Since r defines a single product instance, then for each of the premises, pc i^r ¼ r. The conjunction of the presence conditions of the premises together with r equals r. Then ðC; Note that the way we expand the set of products associated with a fact (line 10, Algorithm 2) preserves the correctness guarantees. The rationale is if a fact holds for a set of products s 1 represented by pc 1 , and another set s 2 represented by pc 2 , then intuitively it holds for the union of both sets (s 1 [ s 2 ), represented by pc 1 _ pc 2 .
The lifted Datalog inference algorithm, together with input facts annotated with presence conditions, are the foundations for variability-aware Datalog analyses.

IMPLEMENTATION
In this section, we describe the design and implementation modifications we made to the Souffl e Datalog engine [29] to make it variability-aware (Section 4.1). Several other Datalog engines are used as backends for program analysis frameworks (e.g., LogicBlox [24]). We chose to lift Souffl e because it implements several optimizations that make it scalable for analyzing relatively big systems, it is available in open source, and it serves as the backend for the Doop pointer analysis framework [12] that we use in our evaluation. Doop implements several pointer and taint analyses of Java programs, where each analysis is represented as a set of Datalog rules. Doop extracts syntactic facts from Java bytecode and passes those facts and the Datalog rules of the particular analysis chosen by the user to a backend Datalog engine (the default backend is Souffl e). The results of the analysis are then generated by the Datalog engine. We outline the modifications we made to the Doop fact extractor (extracting facts from Java bytecode) to make it variability-aware (Section 4.2.1). Finally, we outline the design of a C-language fact extractor, and a set of Datalog analyses for C programs (Section 4.2.2).
The overall architecture of the two analysis pipelines is outlined in Fig. 7. The Doop pipeline takes "Java Bytecode" files (150% representations of product lines annotated with CIDE) as input, and the "Java Fact Extractor" (a part of the Doop framework) extracts Datalog facts from them. Those facts, together with the "Java Rules", are sent to the "V-Souffl e" variability-aware Datalog engine, which generates the results.
The C-language pipeline takes "C Program" source files (150% representations of product lines annotated with the C Pre-Processor (CPP)) as input. Because variability in those files is annotated using the CPP, we use "TypeChef" to parse them. TypeChef generates "V-AST" (Variabilityaware Abstract Syntax Tree), and "V-CFG" (Variabilityaware Control Flow Graph) files, from which the "C Fact Extractor" extracts Datalog facts. The code analyses are implemented as a database of Datalog "C Rules" which are passed to V-Souffl e together with the extracted facts. V-Souffl e generates the analysis results.

Lifting Souffl e
Souffl e [10] is a highly optimized open source Datalog engine. Our modified variability-aware version of Souffl e 1 (V-Souffl e) implements our lifted Datalog inference algorithm d infer (Section 3). As seen in Fig. 8, a Souffl e program (a Datalog IDB) is passed to a "Datalog to RAM Compiler", which parses the program and translates it into a "Relational Algebra Machine (RAM) Program". RAM is a Relational Algebra language [30], in addition to a fixed-point looping operator. Depending on a command-line argument, Souffl e then either interprets the RAM program on the fly using the "RAM Interpreter", or synthesizes a "C++ Program" that is semantically equivalent to the RAM program using the "C++ Synthesizer". The C++ Fig. 7. End-to-end architecture for both the Doop-based Java analysis pipeline and the C-language analysis pipeline. Data files and databases are inputs/outputs to logic components (programs and Datalog rules). Components created by us (C Fact Extractor and C Rules) have a red border; components modified by us (Java Fact Extractor and V-Souffl e) have a light-blue border.
1. Available online at https://github.com/ramyshahin/souffle program can be then compiled using an off-the-shelf "C++ Compiler" into a native "Executable". Native executables are usually at least an order of magnitude faster than interpreted RAM programs [10]. In this paper, we only lift the Souffl e interpreter, which involved modifying the Datalog to RAM Compiler, and the RAM Interpreter components.
Datalog to RAM Compiler. At the syntax level, we extended the Souffl e language with fact annotations -propositional formulas prefixed with '@'. The Souffl e parser is extended with a syntactic category for propositional formulas. AST nodes for facts are extended with a PC field, with a default value of True. Propositional variables are added to a separate symbol table from the one with Souffl e identifiers.
As a part of compiling Souffl e programs into RAM, we turn syntactic presence conditions into Binary Decision Diagrams (BDDs). We use the CUDD library [31] as a BDD engine, and on top of it maintain a map from textual presence conditions to their corresponding canonical BDDs. As defined in d infer, when facts are resolved with a rule, the conjunction of their PCs becomes the conclusion's PC.
RAM Interpreter. The RAM Interpreter takes a RAM program as input, and executes it on the fly. This involves both running the Datalog forward-chaining inference algorithm (infer) to infer new facts, and applying relational algebra operators such as selection, projection and joins. Souffl e implements several indexing and query optimization techniques to improve inference time. To keep our modifications independent of those optimizations, we add the presence condition as a field opaque to the query engine. We only manipulate this field as a PC when performing clause resolution, which takes place at a higher level than the details of indexing and query processing. This way we avoid touching relatively complex optimization code, while preserving the semantics of our lifted inference algorithm.
Some relational features of Souffl e were not lifted. For example, aggregation functions (sum, average, max, min, etc...) still return singleton values. None of those functions is used by Doop or our C-language analyses on lifted facts, so this does not affect the correctness of our results. We plan to address this general limitation in the future to be able to support a wider range of analyses, particularly those that aggregate results over multiple facts.

Fact Extraction
Datalog analyses are not applied directly to programs. Instead, Datalog facts are extracted from the source code, and those facts are the input upon which a Datalog analysis operates. Extracted facts are syntactic in the sense that they correspond to syntactic structures. Variability-aware facts have presence conditions associated to them, denoting the set of products in which a given fact exists. In particular, since several syntactic tokens can contribute to a fact, intuitively, the fact exists only if all the tokens contributing to it exist. In other words, the presence condition of a fact is the conjunction of the presence conditions of all of its tokens.
In this subsection, we outline our approaches to variability-aware fact extraction from Java bytecode and from Clanguage source code.

Lifting Doop
As a part of our evaluation of the Datalog lifting approach outlined in Section 3, we modified the Doop [12] Datalogbased pointer analysis framework, 2 which uses Souffl e as its underlying Datalog engine. Fig. 7 shows the Doop architecture together the rest of the analysis pipeline. Doop is an extensible family of pointer and taint analyses implemented as Datalog rules. In addition, it includes a Java bytecode fact extractor. Doop users select a particular analysis among the available analyses through a command-line argument. The rules corresponding to the chosen analysis (the IDB), together with the extracted facts (the EDB), are then passed to Souffl e.
Since Doop extracts syntactic facts, we need to identify the PCs of each of the syntactic tokens contributing to a fact, and associate the conjunction of those PCs as the fact PC. We had to do this for each type of fact extracted by Doop. The fact PC is just added to a fact as a trailing PC field, prefixed with '@'. Facts with no PC field are assumed to belong to all products (an implicit PC of True).
Our Doop modifications were limited to the fact extractor. None of the Doop Datalog rules were changed. Our fact extraction modifications were scattered because Doop implements extractors for different kinds of facts separately. However, all of our changes were systematic and non-invasive.
All the Java benchmarks we used to evaluate lifted Doop were annotated using CIDE (Section 2.1). The presence condition of a fact is the conjunction of the feature expressions of each of the tokens contributing to it. We use line numbers to find the feature expression of each token. Java bytecode tokens already have their respective source code line numbers stored as metadata. We had to extract the coloring information for each line of code from the CIDE system, and map each line to the feature expression corresponding to its highlighting color. A line's feature expressions is then assigned to the tokens belonging to that line.

C-Language Fact Extraction
In this section, we discuss our C-language fact extraction framework and its implementation -see Fig. 7 for the highlevel architecture. Unlike our Java fact extractor, we designed our C fact extractor from scratch. Existing frameworks often 2. Available online at https://bitbucket.org/rshahin/doop used for C program analysis such as CIL 3 and LLVM 4 only work with pre-processed code, with no support for variability. Performing analysis directly on C instead of LLVM IR or CIL is significantly more difficult due to the more complex syntax and semantics of C. We describe it below.
Variability Annotations in C. Variability can be added to C programs using the #ifdef directive. Fig. 9 shows a simple program with two features, ASSIGN and CREATE. Variability can be nested; in this example, both features need to be set to True in order to compile code under the ASSIGN section.
V-AST and V-CFG Generation. We use TypeChef [6] to generate Variability Aware Abstract Syntax Tree (V-AST) and Variability Aware Control Flow Graph (V-CFG). A V-AST is a standard AST representation of the program, with additional nodes to represent presence conditions. A V-CFG is a standard representation of the control flow of a program, with feature presence condition annotations added to each node. Fig. 11 shows the V-CFG generated by TypeChef for the program in Fig. 9. The format of the file for nodes consists the lines starting with the character "N," followed by the identifier of the node, line number or À1 if not relevant, the function where the node was located, and the variability condition. For edges, it consists of the line starting with the character "E," followed by the IDs of the predecessor and successor nodes, and the variability condition of the edge.
V-AST Processing. Algorithm 3 shows the procedure for generating variability-aware facts from a V-AST. It uses FactDef, a pair of functions fits() and gets(), to define a requirement for a fact to extract, such as a variable assignment, and generate that fact from an AST node if it fits the requirement. Fig. 10 shows an example V-AST generated by TypeChef of the program in Fig. 9. Each PresenceCondi-tionNode in the V-AST corresponds to an ifdef directive from the program. Fig. 12 shows the facts generated from the V-AST in Fig. 10 where the FactDef functions identify and extract facts for pointer analysis, as described in Section 5.2.
We assume that the V-AST generated by TypeChef is correct and that FactDef correctly identifies nodes of interest. Our correctness criterion that the presence condition of the extracted fact is True if and only if the presence condition of all tokens that contributed to that fact is True. In this case, since the fact is generated from a single AST-subtree using FactDef on line 8, we can guarantee that the presence condition of that fact is True iff the presence condition of the subtree is True; this is because currCond on line 3 is set to the conjunction of all parent present condition nodes. Finally, the algorithm assumes that there is no variability within a statement. To handle it, we first generate a V-CFG with Type-Chef -this process automatically splits statements that contain variability -and then parse ASTs of those statements.  V-CFG Processing. Algorithm 4 shows the procedure we use to extract facts from a V-CFG. This algorithm relies on an external function cfgAnalysis(). For a given CFG, this function returns a list of desired relationships, and a list of nodes related to each relationship. We used OCamlgraph 5 to assist with control-flow analysis, specifically, with identifying node domination relationships. The cfgAnalysis procedure  3. https://github.com/cil-project/cil 4. https://llvm.org/ 5. http://ocamlgraph.lri.fr/ can be a wrapper around OCamlGraph or a different directed graph analysis utility. Fig. 13 shows simple dominator relationship facts extracted from the program in Fig. 9, using Algorithm 4. The first and second parameters represent the dominating and the dominated nodes, respectively, and cfgAnalysis() is a function to find all dominator pairs in a directed graph. We assume that the V-CFG generated by TypeChef is correct and that cfgAnalysis() works as described. Our correctness criterion is that the presence condition of the extracted fact is True if and only if the presence conditions of all tokens that contributed to that fact are True. The presence conditions of all CFG nodes are collected on line 5, and then conjoined to generate a fact on line 7. Therefore, the presence condition of the generated fact evaluates to True only if the presence conditions of all nodes are True. conditions.append(node.condition) 6 end 7 presenceCond V (conditions) 8 facts.append((fact, presenceCond)) 9 end

EVALUATION
We evaluate the performance of our lifted version of Souffl e using two sets of benchmarks: Three Doop analyses applied to five Java benchmark product lines (previously used in the evaluation of other lifted analyses [3], [32]), and the three hand-crafted C-language analyses outlined in Section 5.2 applied to the BusyBox 6 product line.
The primary goal of our experiments is to compare the performance of lifted analyses applied to the SPLs to that of running the corresponding product-level analyses on each of the valid configurations individually. Since the number of valid product configurations for some benchmarks is relatively big, it is neither practical nor particularly useful to enumerate all of the valid products and analyze them. Instead, for each SPL, we run the product-level analysis on two code-base subsets: the base code common across all variants, and the 150% representation (the whole SPL codebase, implementing all feature behaviors). Although those two extremes are not necessarily valid products, they are the lower bound and the upper bound in terms of code size, and averaging over them gives an "average" valid product approximation. The expected brute-force performance is the average valid product performance (P-Avg) multiplied by the number of valid configurations.
We split our evaluation into two parts: fact extraction and inference, and evaluate performance in terms of both the processing time and space (size of the fact database in KiloBytes(KB)). Our primary research questions are: RQ1: How do fact extraction time (and size of the extracted fact database) of lifted analyses compare to bruteforce fact extraction?
RQ2: How do the Souffl einference time, and the size of the inferred database, of lifted analyses compare to bruteforce analysis?
RQ3: How consistent are the savings in inference time and size of the fact database across different languages, and across different annotation mechanisms?

Java Benchmarks
In this subsection, we present a set of evaluation experiments on three analyses from the Doop Java pointer analysis framework. The lifted analyses are applied to five different Java benchmark product lines.

Models and Methods
For each of the Java benchmarks, Table 1 lists its size (in thousands of lines of code), number of features, and number of valid configurations according to its feature model. More Fig. 11. V-CFG Generated by TypeChef of the program in Fig. 9. Lines that start with "N" and "E" identify a node and an edge respectively. The numbers represent identifiers for nodes, and identifiers of both nodes for edges. We use three analyses from the Doop framework in our evaluation of the Java benchmarks: Context-Insensitive Pointer Analysis (insens). This is the simplest (and least-precise) pointer analysis in the Doop framework. It does not take any context (e.g., object context, call context) into consideration when trying to resolve which heap objects each pointer might be pointing at.
One-Type Heap-Sensitive Pointer Analysis (1Type+Heap). This is a pointer analysis that takes the type (Java class) in which a heap allocation takes place into consideration. Allocations taking place in the same Java class are merged together.
One-Call-Site Heap-Sensitive Taint Analysis (Taint-1Call +Heap). This is a taint analysis that takes the context of the direct caller of a function (but not callers of callers) into consideration [28]. We use the default sources, sinks, transform and sanitization functions curated in Doop for the JDK and Android [27].
All experiments were performed on a Quad-core Intel Core i7-6700 processor running at 3.4 GHZ, with 16 GB RAM and hyper-threading enabled, running 64-bit Ubuntu Linux (kernel version 4.15).
Pointer and taint analyses work on the whole program, including library dependencies. Since general-purpose libraries usually do not have any variability, the comparison between lifted and single-product analyses is independent of them. Moreover, time spent in analyzing library code, and space taken by their facts, might skew the overall results. We restrict our experiments to application code and direct dependencies only using the Doop command-line argument "-Xfacts-subset APP_N_DEPS".
Doop extracts its facts from Java byte-code. However, SPL annotation techniques work at the source-code level. Feature selection usually takes place at compile-time, which means an SPL codebase is compiled into a single product. To get around this limitation, we had to choose benchmarks that only have disciplined annotations [18], in the sense that adding or removing an annotation preserves the syntactic correctness of the 150% representation. This is not a limitation of our lifted inference algorithm though.
The benchmarks we chose are annotated using CIDE [18], which uses different highlighting colors as presence conditions. We had to extract this color information from CIDE, together with the mapping from colors to locations of tokens (line and column number) in source files. Our fact extractor uses byte-code symbol information to locate tokens, and assign their presence conditions based on CIDE colors. Table 2 summarizes the "average" performance of productlevel fact extraction (P-Avg) and that of the lifted fact extraction for the entire product line (SPL). For each of the three analyses, we compare fact extraction time (in milliseconds) and the size of the extracted database (in KB). For example, for context-insensitive analysis, average fact extraction time of a single product of Prevayler is 1,416 ms, and the average size of the extracted fact database is 3,230 KB. On the other hand, extracting facts from the whole Prevayler SPL at once takes 1,554 ms, and the extracted fact database is 4,407 KB. The difference between P-Avg Time and SPL Time is very small for all three analyses and five benchmarks, which is expected since extraction is syntactic and thus its time is proportional to code-base size, not the number of features. Size of the extracted database is noticeably bigger for lifted extraction (DB SPL columns) because lifted facts are augmented with presence conditions.

Fact Extraction
To evaluate the savings attributed to lifted fact extraction compared to brute-force extraction in terms of time and space, we compute the speedup and space saving factors (P-Avg * jConfðLÞj / SPL). Fig. 14 shows a log-scale bar graph of lifted fact extraction speedup and space savings for context-insensitive analysis. The other two analyses exhibit a similar trend and are omitted here. The figure shows that the time and space savings are proportional to the number of valid configurations of the product line. For example, Lampiro has 2048 valid configurations, and its lifted fact extraction is 2020 times faster than brute-force, with a database 2045 times smaller than the total space of brute-force databases. On the other hand, Prevayler has only 32 valid configurations, with an insens lifting speedup factor of 29, and a space savings factor of 23. Since different analyses typically require different facts, the size of the fact database also varies from one analysis to another. Experimental results do not show a direct correlation between an analysis and the size of its fact database. For example, in Lampiro, the Taint-1Call+Heap databases are significantly bigger than those of 1Type+Heap. BerkeleyDB, on the other hand, exhibits the opposite trend. Fig. 13. Extracted variable node dominator relationship facts from Fig. 11. The first parameter is the identifier of the dominating node and the second is the dominated node. The presence conditions were simplified where possible for readability; the simplification is not done as part of the algorithm.  Table 3 summarizes the performance of lifted analyses on the entire product line (SPL) and that of product-level analyses on an average product (P-Avg). For example, when running 1Type+Heap on an average MM08 product, inference is estimated to take 5,106 ms resulting in a 19,058 KB database. Running the same analysis on the entire MM08 product line, though, takes 5,142 ms, resulting in a 29,453 KB database. Fig. 15 is a log-scale bar graph of the speedup factor and the DB space savings factor for insens. Speedup and space savings trends are again proportional to the number of valid configurations. For example, for BerkeleyDB, lifted insens is about 7.4 billion times faster than brute-force, with a DB 5.6 billion times smaller. All three analyses show similar speedup and disk space savings trends.
Recall that the theoretical bottleneck of the lifted inference algorithm (Algorithm 2) is the satisfiability checks performed when conjoining two PCs (Section 3.1). Since propositional satisfiability is NP-complete, we wanted to evaluate whether it is a bottleneck in practice. While SAT checks are not required to maintain correctness of the lifted inference algorithm, we perform them in order to avoid generating spurious facts that do not exist in any product. An UNSAT presence condition denotes an empty set of products, but what about PCs denoting sets of invalid product configurations? The Feature Model (FM) of a product line specifies which product configurations are valid and which are not. If a fact belongs only to a set of configurations excluded by the FM, then this fact can be removed. Removing spurious facts saves DB space, and, more importantly, it keeps the set of facts searched by the inference algorithm as small as possible, improving the overall performance. We study the impact of SAT checking and using the FM below.
RQ2.1: How much does SAT checking contribute to the processing time of the lifted Datalog engine? Table 4 summarizes the performance of our lifted analyses and the same analyses with SAT checking disabled (noSAT). Figs. 16 and 17 show the noSAT-associated speedup and database size savings, respectively. Recall that we represent PCs using BDDs. SAT checking over BDDs is a constant-time operation [21]. Since conjoining and disjoining BDDs can take exponential time, we disable all BDD operations, keeping only the textual representation of PCs. A speedup factor below 1.0 means that disabling SAT checks slows down inference. This is what we observed for most of the benchmarks. We believe that the slowdown is because of the use of textual representation of PCs which resulted in a much bigger PC table, with slower lookup times. We also do not see any DB savings because noncanonically represented PCs tend to be longer than BDDbased ones, resulting, on average, in more characters (and bytes) per PC. We note that the number of features is relatively low in all of our benchmarks. BDD-based SAT solving is known to perform well when the number of propositional variables is this small. With product lines of hundreds or thousands of features, it is possible that noSAT might result in performance improvements.
RQ2.2: What is the effect of taking the feature model (FM) of an SPL into consideration when running Datalog variability-aware analyses, in terms of inference time and DB size? Table 5 compares the performance of our lifted analyses against the same analyses using the feature model (SAT +FM). SAT+FM entails conjoining the feature model to each PC before performing the satisfiability check. If the PC encodes a set of products excluded by the FM, the conjunction is unsatisfiable. Figs. 18 and 19 show the SAT+FM-associated speedup and space savings, respectively. For most of the experiments, using the FM results in slowdowns and larger DBs. FM usage reduces the number of inferred facts, as observed in Table 6, but the reduction is relatively small. On the other hand, PCs now conjoined with the FM are more complex, taking longer to construct (hence the performance penalty), and more bytes to store (hence the bigger DBs).

C-Language Benchmarks
In this subsection, we present the Datalog rules we used as our benchmarks. We then outline the evaluation process and present the evaluation results.

Models and Methods
We evaluated our C-language analyses on C files from the Busybox product line of Unix command-line tools (version 1.18.5). Out of 522 Busybox files, 33 were excluded due to errors when parsing C files with the TypeChef, 12 due to parsing errors of TypeChef's V-AST by our fact extractor, and over 260 due to lack of variability in the V-CFG. We also excluded the CONFIG_NOMMU feature as we found that it appears in every node of the V-CFG generated by Type-Chef. The resulting 229 files were used in our evaluation.
We have implemented three Datalog analyses which we describe below. 8 Context-Insensitive Pointer Analysis. Similar to contextinsensitive pointer analysis in Doop, the goal is to statically identify the set of heap objects to which each pointer might be pointing. Extracted facts include variable assignments and functions returning pointers. Since this is a context insensitive analysis, all variables that are assigned a return value from the same function are assumed to be pointing to the same location in memory, except for memory allocation functions such as malloc. Fig. 12 shows an example of input facts used by this analysis, with assign facts representing assignment statements and allocHeap facts representing malloc statements. The analysis generates pointsTo facts indicating a mapping between the variables and the allocated heap space.
Control Flow Graph Reducibility. A Control Flow Graph (CFG) is said to be reducible if the graph has no cycles after all back-edges are removed. An edge from node a to node b is said to be a back-edge if b dominates a [33]. This analysis identifies all possible cycles in the graph using Johnson's algorithm [34]. We used Pietro Abate's functional OCaml implementation [35] for fact extraction. This analysis produces a large number of facts as they include all dominance relationships between all nodes in the CFG. For example, in a purely linear program, each additional statement would be dominated by every previous statement, resulting in an exponential growth in the number of facts with every new statement.
Definition-Usage Chains.Definition-usage chains analysis (def-use) finds pairs of statements in the CFG where a variable is assigned (or defined) and then referenced [36]. We used a Datalog implementation of this analysis from the Souffl e tutorials. 9 Extracted facts for this analysis include all edges in the CFG, variable loads (references), and variable stores (assignments). Each load and store fact contains both the identifier of the relevant CFG node and the variable being stored or loaded. The facts for edges correspond to the entire V-CFG generated by TypeChef for that program. The inference algorithm computes and outputs def-use CFG node pairs using the rules of this analysis.
We compared our lifted approach against the estimate of brute-force analysis -see Section 5.1. Fig. 22 represents the distribution of features-per-file for the files we analyzed. The majority of the files had between zero and two features and thus we expected the performance of our analysis to be similar to our brute force estimate. 90 files had three or more features, and thus we expected to see our approach yield a more substantial improvement in these cases. We performed the experiments on a server with a 3 GHz 32core 64-bit processor, 128 GB of RAM, running Ubuntu Linux 18.04. Fig. 21b shows pointer analysis performance, in ms, of our approach compared to the estimate of brute force. Our approach demonstrated clear, near-constant growth in terms of fact extraction time, while brute force demonstrated exponential growth. The difference between our approach and the estimate of brute force is similar in def-use and reducibility analyses. However, these analyses result in larger differences between files with the same number of features. This is because these analyses rely on the CFG structure and relationships between CFG nodes; these can vary vastly depending on the structure and the size of the file, resulting in much larger running times for some files independent of the number of features. We can observe the same pattern for the size of the extracted facts database across all analyses, as shown in Fig. 20. For both storage usage and performance, our approach showed most significant gains versus brute force in files with four or more features. Since fact extraction is a relatively cheap process, the gains in files with fewer than four features are often marginal. The improvement is comparable to that of fact extraction discussed in Section 5.2.2. The benefit in terms of size of the final inference output fact database is similar to that of the input fact database. We can observe a significant improvement in both performance and storage usage using our approach in files with as few as two-three features, depending on the analysis. Inference is generally significantly more expensive than fact extraction. Pointer analysis is an exception to this as the number of generated pointsTo facts usually grows linear to the number of assign and other extracted facts.

Answers to RQ1, RQ2, and RQ3
Evaluation results of the C-language benchmarks reinforce those of the Java benchmarks. For both sets of benchmarks, savings in fact extraction time and extracted fact database size grow exponentially with the number of configurations (RQ1). Similarly, savings in inference time and inferred database size grow exponentially with the number of configurations (RQ2). In addition, consistency of evaluation results across the two language benchmarks (Java and C), and the two different variability annotation mechanisms (CPP and CIDE) allows us to answer RQ3: Neither the language of the benchmarks nor the annotation mechanism seem to affect performance gains for fact extraction or for the inference.

Threats to Validity
For internal threats, we note that all of our Java benchmarks are CIDE product lines. While our lifting approach and implementation are not specific to CIDE (as manifested by our CPP-based evaluation experiments), CIDE limitations make the benchmarks biased towards specific annotation patterns. For example, only well-behaved annotations are allowed. Furthermore, since feature expressions do not support feature negation, all input PCs, as well as conjunctions over those PCs, are satisfiable. We experimented with disabling satisfiability checks (RQ2.1) to see how much they affect performance (while they always return True for this set of benchmarks). As noted previously, the overhead of those checks is marginal. Another internal threat is that we approximate average product performance using only two samples (the maximum and the minimum). These averages are not expected to be completely accurate, but are used to give a brute-force estimate. Our experiments show performance improvement of several orders of magnitude, so we believe that our approximation (compared to more elaborate configuration sampling techniques) can be tolerated. We did not measure the amount of effort we put in making fact extractors variability-aware. We assume that at least a product-level fact extractor (like the Doop fact extractor) already exists; otherwise, product-level analyses would not be usable. Variability-aware modifications to the productlevel fact extractor depend on the programming language, the variability annotation mechanism, and the kind of facts to be extracted.
Finally, our attempt to answer RQ3 took only one step in the direction of validating the generality of results. All of the Java analyses we used come from the Doop framework, and all of our C-language analyses were handcrafted, and applied to only one product line (BusyBox). The inference engine is independent of the fact extraction mechanism used, so facts can be extracted in a similar way from annotative product lines implemented in any programming language.
Mitigating the above threats to validity would involve targeting product lines implemented in more languages, and a wider range of analyses. Practically, the limiting factor here is the need to implement variability-aware fact extractors for each of those languages, which is not trivial. We plan to expand our evaluation to include additional Java and C-language analyses in the future.

RELATED WORK
Different kinds of software analyses have been re-implemented to support product lines [37]. For example, the TypeChef project [6], [7] implements variability aware parsers [6] and type checkers [7] for Java and C. The SuperC project [5] is another C language variability-aware parser. The Henshin [38] graph transformation engine was lifted to support product lines of graphs [9]. Those lifted analyses were written from scratch, without reusing any components from their respective product-level analyses. Our approach, on the other hand, lifts an entire class of product-level analyses written as Datalog rules, by lifting their inference engine (and extracting presence conditions together with facts). Our  language-based lifting approach has been extended in [39] to a functional programming language. SPL Lift [3] extends IFDS [40] data flow analyses to product lines. Model checkers based on Featured Transition Systems [41] check temporal properties of transition systems where transitions can be labeled by presence conditions. Both of these SPL analyses use almost the same single-product analyses on a lifted data representation. At a high level, our approach is similar in the sense that the logic of the original analysis is preserved, and only data is augmented with presence conditions. Still, our approach is unique because we do not touch any of the Datalog rules comprising the analysis logic itself.
Syntactic transformation techniques have been suggested for lifting abstract interpretation analyses to SPLs [8]. This line of work outlines a systematic approach to lifting abstract interpretation analyses, together with correctness proofs. Yet this approach is not automated which means lifted analyses still need to be written from scratch, albeit while being guided by some systematic guidelines.
Datalog engines have been used as backends by several program analysis frameworks. In addition to Doop, examples of analysis frameworks based on logic programming include XSB [26], bddbddb [42] and Paddle [11]. DIMPLE [25] is another declarative pointer analysis framework where rules are written in Prolog. To the best of our knowledge, all those program analysis frameworks have been targeting single products. Our primary contribution is lifting this class of analyses to SPLs in a generic way, without making any analysis-specific assumptions. In addition, our approach can be systematically implemented in any Datalog engine used by any of those frameworks.

CONCLUSION
In this paper, we presented an algorithm for lifting Datalogbased software analyses to annotative SPLs. We implemented this algorithm in the Souffl e Datalog engine, and evaluated performance of three program analyses from the Doop framework on a suite of Java and C-language SPL benchmarks. Comparing our lifted implementation to bruteforce analysis of each product individually, we show significant savings in terms of processing time and database size.
Our Souffl e implementation only lifts the interpreter but not the code generator (compiler). Aggregation functions (e.g., sum, count) are not currently lifted either. We plan to address these implementation level limitations in future work. We also plan to evaluate lifted Souffl e on analyses frameworks other than Doop. Another track for future work is lifting Datalog rules, not just facts. This would allow us to apply a product line of analyses to an SPL all at once. For example, rules of the different flavors of pointer and taint analysis in the Doop framework can be annotated with presence conditions, and the entire set of Doop analyses (or a user-specified subset) can be applied to a single product or to a product line at once. This would add another dimension of performance savings because of the overlapping rules between the different analyses.
Our work can also be extended to lift analysis and verification tools based on Constrained Horn Clauses (CHC) [43] to support SPLs.