Looking For Novelty in Search-based Software Product Line Testing

Testing software product lines (SPLs) is difficult due to a huge number of possible products to be tested. Recently, there has been a growing interest in similarity-based testing of SPLs, where similarity is used as a surrogate metric for the t-wise coverage. In this context, one of the primary goals is to sample, by using search-based algorithms, a subset of test cases (i.e., products) as dissimilar as possible, thus potentially making more t-wise combinations covered. Prior works have shown, by means of empirical studies, the great potential of current similarity-based testing approaches. However, the rationale of this testing technique deserves a more rigorous exploration. To this end, we perform a correlation analysis to investigate the internal relationship between similarity metrics and the t-wise coverage. We find that similarity metrics generally have a significantly positive correlation with the t-wise coverage. This well explains why similarity-based testing works, as the improvement on similarity metrics will potentially increase the t-wise coverage. Moreover, we explore, for the first time, the use of the novelty search (NS) algorithm for similarity-based SPL testing. The algorithm rewards “novel” individuals, i.e., those being different from individuals discovered previously, and this perfectly matches the goal of similarity-based SPL testing. We demonstrate that the novelty score in NS has a (much) stronger positive correlation with the t-wise coverage than the widely used fitness function employed in the genetic algorithm (GA). Empirical results on 31 software product lines, either realistic or artificial, validate the superiority of NS over GA, as well as other state-of-the-art approaches. Finally, we investigate how the performance of NS is affected by the ways of generating new products, and by its key parameters. In summary, looking for novelty provides an alternative way of generating test cases for SPL testing.


INTRODUCTION
A Software product line (SPL) is a family of related products that are built from a set of features, with a feature being some aspect of the system functionality [1], and a product being a set of selected/deselected features. These products, which share some common features (i.e., commonality), are distinguished by specific features they provide (i.e., variability). The commonality and variability of an SPL are often organized by a feature model (FM) [2], [3] which defines all the possible software products by expressing relationships and constraints among features [4]. The adoption of SPLs can benefit the industry in different respects such as decreasing implementation costs, reducing time to market, and improving product quality [5]. There is evidence of numerous companies such as Bosch, Philips, Siemens, Boeing and Toshiba applying SPLs to develop softwares [6], [7], [8].
Despite the benefits that SPLs bring, it raises new challenges with respect to how to ensure the reliability of an SPL. In this regard, testing software product lines becomes crucial to avoid fault propagation to the derived products [9]. Indeed, a defect in a single feature may exist in thousands or even millions of products [7]. Nevertheless, SPL testing is an inherently difficult task because the number of possible products induced by a given FM potentially grows exponentially with the number of features. Ideally, one would like to test all of the valid products, but it is rarely possible in practice due to a huge number of possible products to be tested. Moreover, even if one could test all valid products (for small-scale FMs), it is likely to be inefficient because software testers often have a limited amount of test resources [8], like time and budgets. Therefore, software testers are seeking sampling approaches to reduce the size of products under test so as to meet the release deadline or resource constraints [10]. The combinatorial interaction testing (CIT) [11], among others, is a prominent approach that has been introduced to reduce the size of the test suites while achieving a certain coverage of feature combinations [12], [13]. This approach systematically samples from a large domain of test data. It is based on the observation that most faults are caused by interactions among a small number of features [14]. For example, Kuhn et al. [14] have shown, using empirical data, that the interactions between two features are capable of disclosing 93% bugs for a large distributed system developed at NASA Goddard Space Flight Center. In particular, the twise CIT requires to find a minimal subset of products that cover all the possible interactions between t features by at least one product under test [15]. Since this is an NP-hard problem, several greedy or metaheuristic approaches have been proposed to conduct t-wise sampling, such as MoSo-PoLiTe [16], CASA [17], Chvatal [12], [18], ICPL [19], ACTS [20], IncLing [21], and those proposed in [22], [23], [24]. A review of product sampling tools for SPLs can be found in a recent survey [10].
However, existing t-wise sampling techniques face the scalability issue [25]. For large real-world SPLs, they often run out of memory, do not terminate, or take too much running time [4], [25], [26], [27], [28]. Although the t-wise testing can drastically reduce the number of products to consider (compared to testing all possible products exhaustively), this number can still be too large to fit the test budget allocated. This is particularly true if the number of features, and/or the strength t is large. For example, according to [19], 480 products are required to achieve a full 2-wise coverage for the Linux kernal FM (2.6.28.6-icse11) with 6,888 features [29]. Similarly, Song et al. [30] mentioned that one needs to test too many products to obtain a full 7-wise coverage of a realistic system. In fact, the maximum number of all the t-wise combinations of features is C t n · 2 t [14], where n is the number of features. Clearly, a huge amount of products are required to cover all the combinations in case that n and t are large. In practice, most real-world SPLs are variabilityintensive, with hundreds or even thousands of features [29], like for example Linux kernal [29], Automotive02 [31] and Eclipse [32]. Moreover, there is a practical need to deal with high interaction strengths (t > 2) [33], [34], [35]. Petke et al. [34] showed that higher-strength test suites are able to find more faults, and Kuhn et al. [14] indicated that almost all the defects can be found for the 6-wise coverage.
As an avenue to circumvent the scalability issue faced by the t-wise CIT, the similarity testing [36], [37], [38], [39], a technique used to select a subset of test cases by increasing diversity among test cases so as to achieve a given goal, such as maximising the fault detection rate or the structural coverage criteria, was introduced to the SPL testing. Henard et al. [4] applied similarity testing to SPLs in order to sample and prioritize products to test. The key idea is to mimic t-wise testing by maximizing diversity among products. In their work, a genetic algorithm (GA) is designed to evolve a population of products by optimizing the fitness function which is defined based on a similarity metric. Their results showed that the approach is promising for large SPLs and high interaction strengths, forming a scalable and flexible alternative to the t-wise testing. Later, Fischer et al. [40] empirically assessed this approach with respect to fault detection capability in the Drupal [41] case study. They reported that Henard's similarity approach is useful to detect interaction faults in Drupal, being also able to yield a t-wise coverage comparable with CASA [17]. These findings indicate that similarity is a suitable surrogate metric for the t-wise coverage. Using a different similarity metric, Al-Hajjaji et al. [9] also adopted similarity testing for SPLs. In particular they focused on similarity-based prioritization which orders a set of the products before they are tested. In a subsequent study [7], this work was substantially extended by the same authors in the experimental evaluations using three SPLs with real faults in their source code. There have been a number of other works exploiting similarity in the context of SPL testing, e.g., [21], [42], [43], [44], [45], [46], [47], [48]. This is also the topic we focus on in this article.
The suitability of similarity as a surrogate metric for the t-wise coverage was demonstrated in the aforementioned prior works mainly by means of observing and analyzing empirical results. However, there has been no work on indepth explanation of this, particularly from the statistics point of view. In this paper, we make a first step along this line. In particular, a correlation analysis is performed to explore the internal relationship between similarity metrics and t-wise coverage in the context of SPL testing. Our primary finding is that similarity metrics (e.g., the fitness used by Henard et al. [4]) has a significantly positive correlation with the t-wise coverage for most of the FMs considered in this study. This clearly suggests that an improvement on similarity metrics will potentially foster the increase of the t-wise converge, and it thus makes sense to apply a searchbased method to optimize similarity metrics to indirectly increase, as much as possible, the t-wise coverage. Interpreting the rationale of similarity-based testing by using correlation analysis forms the first contribution of this paper.
Furthermore, we apply, for the first time, the novelty search (NS) algorithm [49], [50] [51] to the similarity-based testing of SPLs. Unlike classical objective-oriented evolutionary algorithm, the NS algorithm abandons objectives and it promotes evolution through the search for "novelty" in the behaviour (or decision) space, wherein "novelty" implies being different from individuals discovered previously. More precisely, instead of rewarding performance based on an objective, the NS rewards performance based on any novelty metric, which measures the uniqueness of an individual with respect to the rest of the population and an archive of past individuals [50]. This way, it creates a constant pressure to do something novel, being able to generate a set of diverse individuals. The NS has been successfully applied to a certain number of domains, e.g., maze navigation and biped locomotion [50] and the evolution of plastic neural networks [52]. It has been recently pointed out by Romero et al. [53] that NS seems to well fit precepts of search-based software engineering (SBSE). However, there has been only a very limited number of studies relating to the application of this algorithm to the SBSE field. Boussaa et al. [54] applied the NS algorithm to the test data generation problem based on statement-covered criteria. As stated by Boussaa et al. [54], that work is the first one in the literature to adopt NS to generate test data. López-López et al. [55] used, for the first time, NS for software improvement by combining it with a genetic improvement system. Recently, Romero et al. [53] have explored the use of NS for the next release problem [56]. In this paper, we apply NS to the similaritybased testing of SPLs because we notice that NS perfectly matches the goal of this testing technique, i.e., searching for a set of diverse products. In particular, the suitability of the proposed NS is evaluated on 31 FMs, being either real or artificial, with respect to the t-wise coverage. To investigate the scalability of the proposed method, we consider small-, moderate-and large-scale models and five values of t (i.e., t = 2, . . . , 6), following the common practice in [4]. Moreover, the NS algorithm is comprehensively compared with three highly-related approaches [4], [7], and all of them are able to well scale to large FMs and high interaction strengths. Finally, we investigate how NS is affected by the way of generating new configurations and by its key parameters.
In summary, this paper provides the following contributions: 1) We perform a correlation analysis to explore the internal relationship between similarity metrics and the t-wise coverage, observing that similarity metrics have a significantly positive correlation with the t-wise coverage. The above finding, which has not been revealed before, is important as it forms the theoretical foundation of the similarity-based testing (as a surrogation of the t-wise testing). 2) We demonstrate that NS is inherently an ideal tool for the similarity-based testing of SPLs. We show that, compared with the fitness function used in GA [4], the novelty score adopted by NS has a (much) stronger positive correlation with the t-wise coverage in most cases. This potentially implies the superiority of NS over GA in achieving higher twise coverage. According to our experiments, NS indeed performs better than GA, as well as other related algorithms. 3) In search-based approaches, the unpredictable way [4], implemented by the randomized SAT4J solver, forms a state-of-the-art strategy for generating new products in the SPL field [8], [57], [58], [59], [60], [61]. In this paper, we empirically demonstrate that combining two types of satisfiability solvers (i.e., the randomized SAT4J 1 + probSAT 2 [63]) in a similar way as in [59], [60] can be more effective than the unpredictable way when generating new products on some large-scale real-world FMs, such as the Linux kernel model (2.6.28.6-icse11) [29]. We consider this as one of the contributions because it provides a powerful alternative to generating new products in the domain of SPL testing.
The remainder of this paper is organized as follows. Section 2 presents concepts and notations used in this work. Section 3 details the proposed method, and Section 4 reports on the empirical study. In Section 5, we discuss threats to validity before reviewing related work in Section 6. Finally, Section 7 concludes the paper and outlines possible research lines for future studies.

CONCEPTS AND NOTATIONS
In this section we present background concepts and notations used in this paper.

Feature model configurations as test cases
This paper focuses on model-based testing of SPLs in which an FM [2] is used as the variability model. An FM is treelike structure encompassing a set of features and constraints among them. Since feature modeling is a popular way of documenting commonality and variability of an SPL in both academic and industrial communities [4], [10], [64] clause if the feature model is transformed to a Boolean formula in conjunctive normal form (CNF) [65].
In this context, we define a configuration of an FM as a set of selected/deselected features. Formally, we have the following definition. In the program, a configuration is encoded as a binary vector, and each of the elements takes either 1 (selected) or 0 (deselected). Notice that a configuration can be also called a product in the SPL terminology, an individual (or a solution/point) in the evolutionary computation (EC) terminology. In essence, these terms refer to the same thing, and thus they are used interchangeably in this paper.

Definition 3. A test suite of an SPL is a list
In this context, the valid configuration CF i is called a test case.

T-wise testing and coverage
The t-wise testing for SPLs focuses on the combinations of t ≥ 2 features of an SPL [19], [66]. It considers all the possible interactions between selected and deselected features. We list the following definitions related to the twise testing.

Definition 5. A t-set is called a valid t-set if it satisfies the constraint C of the FM. A t-set that does not satisfy C is said to be invalid.
A valid t-set, ts, is covered by a configuration CF , if ts ⊆ CF . Note that invalid t-sets need not to be covered [7].
VT f m is the set of all the valid t-sets of the given FM; VT CFi denotes the set of t-sets covered by the configuration CF i , and | · | returns the cardinality of a set. For simplicity, we also refer to t-wise coverage as coverage.

THE NOVELTY SEARCH FOR SPL TESTING
The NS algorithm for test case generation in SPL testing, as outlined in Algorithm 1, aims at looking for a specified amount of diverse configurations, using the specified amount of time. The key idea behind this algorithm is to use "novelty" for the selection of promising configurations. In the algorithm, the following control parameters need to be specified before the algorithm begins. The first two parameters are common in search-based algorithms, while the remaining two ones are unique to the NS algorithm.
• The archive size (N ) specifies the number of configurations to be returned. The setting of this parameter depends largely on the demands of software testers.

•
The execution time allowed (max t) serves as the termination condition of the algorithm. Similar to N , this parameter is preset by software testers.

•
The repair probability (P r ) is introduced when generating new configurations. A random configuration is repaired, with the probability P r , using the probSAT solver [63]. Proper values of this parameter will be experimentally investigated later in Section 4.4.1.
• The neighbor size (N b ) specifies the number of neighbors to be considered when evaluating the "novelty" of a configuration. Empirical studies will be conducted in Section 4.4.2 to tune this parameter.
Notice that the NS algorithm allows software testers flexibly specifying the number of configurations as well as the execution time. Such flexibility is desirable for the sampling algorithm to meet a given testing budget [4]. In the following subsections, we will give details on the main algorithmic components.
Algorithm 1 NS algorithm for SPL testing Input: archive size (N ), execution time allowed (max t), repair probability (P r ) and neighbor size (N b ) Output: archive A 1: Initialize the archive A by generating N configurations in an unpredictable way as in [4] 2: while the elapsed time is less than max t do 3: Generate a random configuration c 4: if c is not valid then 5: if rand(0, 1) < P r then 6: Repair c using the probSAT solver [63] 7: else 8: Replace c by an unpredictable configuration got from the randomized SAT4J solver 9: end if 10: end if 11: Calculate the novelty score ρ(x), where x ∈ A ∪ {c} 12: c worst ← the worst member in A concerning the novelty score // c worst has the minimum novelty score 13: if ρ(c) > ρ(c worst ) then 14: c worst ← c // Replace the worst archived member by c 15: end if 16: end while 17: return A

Initialization of the archive
According to Line 1 of Algorithm 1, A is initialized with N valid configurations produced in an unpredictable way [4]. Since an FM can be converted into a Boolean formula [65], an off-the-shelf satisfiability (SAT) solver, such as SAT4J 3 [62], can be naturally used to generate valid products. However, the internal order used by the solver to parse the logical clauses and literals typically produces the same solution in a deterministic way, resulting in the loss of diversity of configurations. To overcome this issue, Henard 3. https://www.sat4j.org/ et al. [4] proposed to randomize the order how the logical clauses and the literals are parsed by the solver. In the implementation, they used the randomized SAT4J solver [62]. This way, it is impossible to predict the next product to be returned. In other words, products are generated in an unpredictable way. Experimental results have shown that this unpredictable way can significantly improve the diversity of the configurations [4], [58].
Being simple and effective, the above unpredictable way forms a state-of-the-art strategy to sample configurations from the space of all the valid products [57]. In fact, this strategy has been widely used to generate configurations in the context of both optimal software product selection [58], [59], [60] and software product line testing [8], [61].
To initialize the archive A, we also adopt this unpredictable way to generate configurations. Notice that, to further improve diversity, the archive only allows the entry of configurations that are different from the already archived ones.

Generation of new configurations
In the iteration process (Lines 3-15 of Algorithm 1), the procedure tries to update A by generating a new valid configuration each time. To this end, we first produce a random configuration c, with each feature being either selected or deselected both with a probability 0.5. Due to constraints among features, the randomly generated configuration is very likely to be invalid. In this case, it is either repaired by the probSAT solver [63] with the probability P r (see Line 6), or replaced by an unpredictable configuration obtained from the randomized SAT4J solver [62] (see Line 8).

Repair configurations using probSAT
The procedure of probSAT [63], an SAT solver based on the stochastic local search (SLS) 4 [68], is given in Algorithm 2. Essentially, it iteratively flips a selected variable (by changing its value from 0 to 1, or vice verse) until the number of flips reaches the maximum value, max f lips. For an invalid configuration c, it violates at least one clause. As shown in Line 3 of Algorithm 2, we randomly pick one of these falsified clauses, denoted by cls. Next, the procedure works out the break value of each variable v in cls (Line 5 in Algorithm 2). The break value of a variable is defined as the number of satisfied clauses that would become falsified after flipping this variable [69]. It is clear from this definition that it would be better (in a general case) to flip the variable with smaller break value. To efficiently compute this value, we adopt the fast procedure proposed in [69]. For details, please refer to the original study.
Subsequently, as shown in Line 6, the break (v) is transformed by the following polynomial function 5 . 4. In addition to SLS-style solvers, another mainstream of highperformance algorithms for satisfiability solving is conflict-driven clause learning (CDCL) solvers [67], such as SAT4J. Different from CDCL-style solvers which need to methodically traverse the whole search space, SLS-style solvers are typically greedy algorithms aiming at quickly satisfying clauses as many as possible. Therefore, SLS-style solvers are in general computationally efficient. However, they have no guarantee on finding a satisfying solution for each solver call.
5. According to [63], break (v) can be transformed by other types of functions. The polynomial function is chosen in this study primarily because of its simplicity and effectiveness [63].
where ϵ = 1, and c b > 1.0. Clearly, f (v) monotonically decreases with respect to break(v). This means that we tend to choose the variables with large f (v).

Algorithm 2
Procedure of the probSAT solver Input: an invalid configuration (c), maximum number of flips (max f lips) Output: the repaired c 1: n ← 0 // the number of flips 2: while n ≤ max f lips do 3: Randomly pick a falsified clause cls with respect to the configuration c 4: for each variable v in cls do 5: Compute the break value of v, i.e., break (v) Flip the variable var in c 10: n + + 11: end while 12: return c Finally, according to Line 8 of Algorithm 2, the variable var to be flipped is chosen based on the probability . This calculation of probability ensures that variables with larger f values are given more chances to be selected. The chosen variable is then flipped, coming with the counter n increased by 1 (Lines 9 and 10).
It should be mentioned that repairing SPL configurations using probSAT has already been explored in our previous work [60] in the context of multi-objective software product selection from SPLs. It has been shown that probSAT is more effective than WalkSAT [70], another popular SLSstyle solver, in search for dissimilar products. Since this fits well our goal in this study (generating SPL configurations as dissimilar as possible), we choose probSAT instead of Walk-SAT. Moreover, the possibility of using SLS-style solvers to repair SPL configurations has not yet been explored in the context of SPL testing. We make the first step in this regard. Most importantly, as will be demonstrated in Section 3.2, the introduce of probSAT-based repair operator can significantly improve the coverage on some FMs, like for example the Linux kernel model (2.6.28.6-icse11) [29] which is the largest FM widely studied before.
In the detailed implementation, max f lips is set to 4,000, following the common practice in [59], [60]. The parameter c b in (1) is set to 2.165, and this value is suggested by the developers of probSAT. Since we use standard values from the literature for the two parameters, a tuning phase is not required in this case.

Replacement based on randomized SAT4J solver
In case that the random configuration c is invalid, it will be replaced, with the probability 1 − P r , by an unpredictable configuration obtained from the randomized SAT4J solver [4]. As discussed previously in Section 3.1, the randomized solver enables an exploration of the valid search space in an unpredictable way, being capable of improving diversity of configurations. To implement the randomized solver, following the common practice in [4], we primarily randomize the order in which the assignments (true or false) to literals are instantiated. More specifically, literals are assigned with either true (i.e., 1) or false (i.e., 0) both with a probability 0.5. That way, it prevents from generating biased configurations towards either true or false assignment.

Evaluation of novelty
According to Line 11 in Algorithm 1, the generated configuration c (after repair or replacement) and all the archived members are jointly evaluated for novelty. For any x ∈ A ∪ {c}, the novelty score of x is given by where N b is the neighbor size, and µ i is the i-th nearest neighbor of x with respect to the following Jaccard distance [71].
where | · | denotes the cardinality of a set. Since both x and µ i are sets according to Definition 2, d(x, µ i ) is a set-based distance. Clearly, the distance varies between 0 and 1. In particular, a distance of 0 indicates two identical configurations, while a distance equal to 1 suggests that the two considered configurations are totally different. In fact, this distance metric was adopted in [4] to measure the degree of similarity between two configurations. Given its good performance presented in the prior work [4], we choose it in our study as a measure of behavioral difference between two configurations in the search space. According to (2), the novelty score is defined as the average distance to the N b -nearest neighbors of a configuration, where N b is a fixed parameter. In fact, the novelty score estimates the sparseness of a point in the decision space. If the score of a given point (or configuration) is large, then it is in a sparse area; in contrast, it is in a dense area in case that the novelty score is small.
It is important to note that the novelty score measures how unique an individual's behaviour is, with respect to the behaviours of both archived individuals and the current one that represents the most recently visited point [50]. This is reason why we consider the set A ∪ {c}, instead of A, when evaluating the novelty of an individual.

Reward novel individuals
As shown in Line 12 of Algorithm 1, the worst member in A, c worst , is identified by comparing novelty scores. After that, c worst will be replaced by the newly generated c in case that ρ(c) > ρ(c worst ). The above operation simply aims at rewarding "novel" individuals, creating a constant pressure to do something new. Indeed, the new individual that is far away from its predecessors takes place of the archived one with the least novelty. That way, the search is driven toward unexplored regions, enabling a diverse exploration of the search space.
It should be noted that no explicit objective is used in the NS algorithm. Instead, by attempting to maximize the novelty metric defined in the decision space, the algorithm tends to generate dissimilar configurations. This is precisely what we pursue in SPL testing.

Why Novelty Search?
As discussed in [53], NS is well suitable for SBSE problems. Going one step further, we demonstrate in this paper that NS is an ideal tool for the similarity-based testing of SPLs because of the following truths.

•
In SPL testing, a set of test cases are required to cover valid t-sets as many as possible. The NS algorithm maintains an archive which stores the first N most novel individuals found during the whole search process. These individuals can directly serve as test cases for the SPLs.

•
The results presented in [4] suggest that two dissimilar configurations are more likely to cover a greater number of valid t-sets than two similar ones. Therefore, the goal of similarity-based SPL testing is essentially to find a set of configurations as diverse (or dissimilar) as possible. In fact, different from fitness-oriented evolutionary algorithms which drive the individuals towards peaks of fitness, the NS is born to achieve this goal since it constantly searches for "novel" (a synonym for "diverse") individuals. This way, the diversity of the population can be naturally improved.
• Previous works [4], [9] on SPL testing ambiguously adopt the idea of NS, i.e., searching for diverse configurations using heuristics like GA to optimize similarity-based fitness function. In this paper, we explicitly use NS because of its good theoretical properties, e.g., behaving like a uniform random search process in the behavior space [51] and creating a pressure for high evolvability even in bounded behavior spaces [72]. These properties are very useful in guiding the search towards diverse configurations in the testing of SPLs.
Experiments presented in the next section will show that NS can indeed obtain promising results in testing SPLs with respect to the t-wise coverage.

EMPIRICAL STUDY
The empirical study conducted in this section aims at answering the following four research questions. The first research question is a foundational question in similarity-based testing of SPLs. Indeed, if similarity metrics have no relationships with the t-wise coverage, then it is meaningless to optimize the similarity metrics so as to achieve a decent t-wise coverage. To address RQ1, a correlation analysis is performed to investigate the internal relationship between two similarity metrics (i.e., the similarity-based fitness function [4] and the novelty score) and the t-wise coverage. The second research question amounts to evaluating the NS algorithm in comparison with some state-of-the-art approaches. We expect the NS to provide a t-wise coverage better than or close to the state of the art. The third research question aims at comparing different ways of generating new configurations in similaritybased SPL testing. Currently, the unpredictable way proposed by Henard et al. [4] forms a state-of-the-art strategy for new individuals generation. However, it remains unknown whether this strategy can be further improved. Finally, the fourth research question seeks to provide useful guidelines for tuning parameters in the NS algorithm. Since this is the first work which adopts NS for the testing of SPLs, it is of practical importance to give some suggestions on the setting of its key parameters.
The conducted experiments are performed on a Quad Core@2.20 GHz with 8 GB of RAM. As shown in Table 1, this study employs 31 FMs which are divided into three categories. The first category is composed of 12 FMs with the number of features lower than 200; they are referred to as small-scale FMs. The second category consists of 13 models (with the number of features larger than 200 but lower than or equal to 1,000), and they are referred to as moderate-scale FMs. The third category contains 6 models of large size (beyond 1,000); they are referred to as large-scale FMs. For each FM, Table 1 presents its name, the number of features, the number of CNF constraints, the number of free features and the number of valid 2-sets.
A free feature is the feature whose assignment is undetermined, while a fixed feature can be either mandatory or dead. A mandatory feature must be presented in any valid configuration, whereas a dead feature cannot be included in any case. It is the number of free features that determines the size of the search space. According to Table 1, the number of free features ranges from the smallest 18 to the largest 16,664.
For small-and moderate-scale FMs and for t = 2, we enumerate all the possible t-sets and ask an SAT solver to determine whether they are valid or not. For the large FMs, it is very time-consuming to obtain all the valid t-sets even for t = 2 [4]. To compute the t-wise coverage in this case, we instead randomly generate a set of 10,000 valid t-sets, which can be seen as a sample of the whole space of all the valid t-sets. Since the number of valid t-sets explodes as t increases, we sample 10,000 valid t-sets for the FMs when t ≥ 3. The number of valid 2-sets is listed in the last column in Table 1.
Finally, notice that most of the FMs have been chosen by Henard et al. [4] in their empirical study, and they are taken from either the SPLOT repository [73] 6 or the LVAT repository [29] 7 . In this work, we add three moderate FMs, i.e., E-shop from SPLOT; toybox and axTLS from LVAT, and 6. SPLOT: http://www.splot-research.org/ 7. LVAT: http://code.google.com/p/linux-variability-analysis-tools two large FMs, i.e., Automotive01 8 and Automotive02 [25], [31] 9 . The two large FMs are closed-source product lines from automotive industry, and are well suited as subjects for examining the scalability of t-wise testing algorithms [25]. Regarding the FMs, 20 of them are real, while 11 are artificially generated by the SPLOT FM generator [73], with the prefix being "SPLOT-Generated-FM" in their names.

Correlation analysis between similarity metrics and t-wise coverage (RQ1)
In this section, we perform a correlation analysis to investigate the relationship between two similarity metrics and the t-wise coverage. We first briefly introduce the two similarity metrics, and then present results of the correlation analysis. Finally, answers to RQ1 are given.

Similarity metrics under study
Two similarity metrics are considered. Given a test suite with N configurations, i.e., T S = CF 1 , . . . , CF N , both similarity metrics map a test suite to a positive real value. Formally, the first one, as given in Eq. (4), is called the similarity-based fitness function [4].
The second one is formulated as follows: where ρ(CF i ) is the novelty score of a single configuration (see Eq. (2)). This function generalizes the calculation of novelty score from a single configuration to a test suite T S. Intuitively, according to [4], the higher both similarity metrics of a given TS, the higher the distance among configurations, leading to potentially larger t-wise coverage. However, this has not been rigorously verified. Since it may be very difficult to directly prove the above argument, we instead statistically demonstrate it by performing a correlation analysis. To this end, we generate 100 samples for each FM, with a sample being a test suite of 100 configurations. Note that these configurations are generated using the unpredictable strategy suggested by Henard et al. [4]. For each sample, we can calculate its t-wise coverage, as well as the values of the two similarity metrics. This way, we obtain 100 pairs of data for each FM regarding each similarity metric, enabling us to perform a correlation analysis to observe how and, to what extend, the t-wise coverage is correlated with the similarity metrics.

Results of the correlation analysis
The correlation analysis returns the correlation coefficient r and the p-value for testing the hypothesis that there is no relationship between the two random variables (null hypothesis). The r measures the linear dependence between two variables, and its value can range from -1 to 1, with -1 representing a perfect negative correlation, 0 representing no correlation, and 1 representing a perfect positive correlation. The p-values range from 0 to 1, where values smaller than the significance level α (default is 0.05) indicate that the  corresponding correlation is considered significant. Table 2 summarizes results of the correlation analysis performed on small-scale FMs. In almost all the cases, as seen, both similarity metrics show a positive correlation with the t-wise coverage. Moreover, most of the correlations are statistically significant since p < 0.05. In particular, it is observed that values of p are equal or very close to 0 in most cases. Going one step further, we work out the percentage of cases in which significant correlations are observed. It is 83.3 % for the similarity-based fitness, and 81.7% for the novelty score. From this perspective of view, the two metrics have no obvious differences. However, the novelty score shows better (or larger) r values than the similarity-based fitness in 63.3 percent of all the cases, and worse in only 36.7 percent. This implies that the novelty score would be more suitable than the similarity-based fitness as a substitution of the twise coverage.
In a similar way, we perform a correlation analysis on all the moderate-scale FMs, and tabulate the results in Table 3. Regarding the novelty score, it is observed that all p-values are smaller than 0.05. Regarding the similarity-based fitness, the corresponding correlations are considered significant on all the moderate-scale FMs except toybox. However, it can be found from the r values that the novelty score has a (much) stronger positive correlation with the t-wise coverage than its counterpart on all the feature models regardless of the value of t. Notice that r values for the novelty score are larger than 0.9 in a number of cases, which suggests that the novelty score and the t-wise coverage are almost linearly dependent.
The correlation analysis results on large-scale FMs are presented in Table 4, where we can find that both similarity metrics, just as in the moderate-scale case, have significantly positive correlations with the t-wise coverage on all largescale FMs for all values of t, expect Automotive02 for t = 4. In this case, the p-value for the novelty score is sightly larger than 0.05. Moreover, as clearly reflected by the r values, the novelty score has a (much) stronger correlation with the twise coverage than the similarity-based fitness.

RQ1 summary
The correlation analysis performed in this section brings out the following conclusions. First, both the similarity-based fitness and the novelty score have, in general, a significantly positive correlation with the t-wise coverage. Moreover, this relationship is not affected by the size of feature models and by the twise strengths. The above findings explain why similarity can be used as a surrogate metric for the t-wise coverage. More importantly, it makes the similarity-based SPL testing theoretically solid. Second, the novelty score has, in general, a (much) stronger positive correlation with the t-wise coverage than the similarity-based fitness suggested by Henard et al. [4]. It means that the novelty score should be more effective than the similarity-based fitness in guiding the search towards a set of configurations with a decent t-wise coverage. In the forthcoming subsection, we will empirically verify this by comparing NS with three highly-related approaches, one of which is Henard's GA algorithm developed on the basis of the similarity-based fitness function.

Comparison with state-of-the-art algorithms (RQ2)
This section focuses on demonstrating the effectiveness of NS through a comparative study. To begin with, we give a brief introduction to the algorithms under examination, and then we specify the experiment setup used in the empirical study. In what follows, experiment results are presented and conclusions are summarized.

Algorithms under comparison
We compare NS with three state-of-the-art algorithms, i.e., Unpredictable [4], GA [4] and SamplingDown [7]. Notice that we choose the above algorithms because: 1) they are all dedicated for similarity-based testing of SPLs, and 2) they are all able to scale to large FMs and high t-wise strengths.

•
The Unpredictable approach, suggested by Henard et al. [4], uses the randomized SAT4J solver (see Section 3.2.2) to generate N configurations in an unpredictable way. This approach was chosen in [4] as a comparison basis because of its good scalability and high efficiency.

•
The GA algorithm, also proposed by Henard et al. [4], may be the first one using a similarity-based heuristic to generate and prioritize product configurations for especially large SPLs. The algorithms starts with an initial population consisting of N configurations generated in the unpredictable way. Then they are prioritized using the global maximum distance prioritization method, and evaluated using the similarity-based fitness function defined by Eq. (4). The next step consists of attempting to replace the worst configuration by an unpredictable one got from the randomized SAT solver. This replacement is permitted if and only if the similarity-based fitness increases. The above step is repeated until the termination condition is satisfied. According to Henard et al. [4], this technique can be seen as a (1+1) genetic algorithm without crossover.

•
The SamplingDown is adapted from the similaritybased prioritization proposed by Al-Hajjaji et al. [7]. The original approach, applied on any product sample, focuses solely on ordering products to increase the probability of detecting faults faster. In this paper, we extend it to an approach which samples down from a large set of unpredictable configurations. More specifically, the algorithm begins with a large initial set of configurations, then it selects candidates one by one based on similarity. First, the product with maximum number of features (or all-yes-config [75]) is selected. Then, we choose one by one the next product that has the minimum similarity with the already selected products. The process continues until N products are selected (N is often specified by software testers according to their test budgets).

Experiment setup
Experiment setups used in this section are summarized as follows.

•
The number of returned configurations is 100 for all the algorithms, following the practice in [4].
• Each algorithm is independently run 30 times, and we present and analyze the experiment results regarding mean values and standard deviations of the obtained t-wise coverage.
• For both NS and GA, the termination of the algorithms can be flexibly controlled by specifying the maximum running time allowed (i.e., max t). Both algorithms are allowed to run 6 seconds for small FMs, 30 seconds for moderate FMs, and 600 seconds for large FMs. These settings follow the practice in [59]. For Unpredictable and SamplingDown, their termination can not be manually controlled. Instead, they are automatically terminated once 100 configurations are generated/selected.

•
In the NS algorithm, control parameters are set as follows: P r = 0.1 and N b = 20. The turning of the two parameters can be found later in Section 4.4. For the SamplingDown algorithm, the number of initial configurations is set to 1,000. For the remaining two algorithms, no control parameters are involved. Table 5 presents mean values and standard deviations of the t-wise coverage obtained on the small-scale FMs. To determine whether the difference between NS and each of the peer algorithms is significant or not, we perform a Mann-Whitney U Test with a 0.05 significance level, following the guidelines suggested by Arcuri and Briand [76]. In particular, the symbols •, ‡ and • indicate that NS performs better than, equivalently to and worse than the corresponding algorithms, respectively. Furthermore, the best mean results for each FM and each value of t are indicated by a darkgray background, and the second-best results a light-gray background.

Experiment results
According to the summary at the bottom of Table 5, NS is the best-performing algorithm on small FMs, obtaining the best/second-best results in about 59/60 ≈ 98% cases. It is followed by SamplingDown which performs either best or second best in 35/60 ≈ 58% cases. Being competitive with SamplingDown, GA is able to obtain promising results in 31/60 ≈ 52% cases. Finally, the Unpredictable, the most ineffective algorithm, performs best in only two cases, and second best in only four cases. For moderate FMs, as shown in Table 6, NS is still the best algorithm, dramatically outperforming the other three algorithms. In particular, NS obtains the best average t-wise coverage in almost all the 65 cases except E-shop for t = 2 and t = 3, and SPLOT-Generated-FM-1000-3 for t = 2. It is observed that GA ranks second concerning the overall performance on moderate-scale FMs. As for the large-scale FMs, it can be found from Table 7 that NS and GA are highly competitive with each other, obtaining the best/second-best results in 29/30 ≈ 97% and 28/30 ≈ 93% cases, respectively. Clearly, both algorithms are overwhelmingly better than SamplingDown and Unpredictable, which are capable of obtaining the second-best results in only one and two cases, respectively.
Going one step further, we compute the percentage of cases in which NS performs better than (•), equivalently to ( ‡) and worse than (•) each peer algorithm regarding all the FMs and all the t values considered (resulting in 155 cases in total). As shown in Table 8, NS shows a significant TABLE 5 Mean values and standard deviations (in brackets) of the t-wise coverage on small-scale FMs for t = 2, . . . , 6. The best and the second-best results are indicated by a dark-gray and a light-gray background, respectively. improvement over GA, SamplingDown and Unpredictable in about 77%, 89% and 94% cases, respectively. Conversely, NS performs significantly worse in about 5%, 1% and 0% cases compared against the above three algorithms, respectively. These results imply that the performance of the algorithms follows the order: NS > GA > SamplingDown > Unpredictable.
Since both NS and GA are search-based approaches, one may be interested in how the performance (regarding the t-wise coverage) changes during the search process. To this end, we record and plot curves of the t-wise coverage with respect to the running time on some representative TABLE 6 Mean values and standard deviations (in brackets) of the t-wise coverage on moderate-scale FMs for t = 2, . . . , 6. The best and the second-best results are indicated by a dark-gray and a light-gray background, respectively. small, moderate and large FMs. As shown in Fig. 1, the twise coverage for NS is better than (or at least comparable to) that for GA throughout the search process on all the FMs selected regardless of the value of t. In particular,  7 Mean values and standard deviations (in brackets) of the t-wise coverage on large-scale FMs for t = 2, . . . , 6. The best and the second-best results are indicated by a dark-gray and a light-gray background, respectively.  NS shows an obviously larger t-wise coverage than GA in most cases, and a close t-wise coverage to GA on several large FMs, e.g., Automotive01, SPLOT-Generated-FM-5000 and Automotive02. As observed, the t-wise coverage, in the majority of the cases, tends to increase as the search proceeds. However, we notice that the curve of the t-wise coverage for GA obviously declines on CounterStrikeSim-pleFeatureModel for t = 5 and t = 6, and ecos-icse11 for t = 6. This fact implies that the search via GA may hamper the t-wise coverage in some cases. One of the most likely explanations of this phenomenon is that the fitness function adopted by GA gives an improper measure of the similarity [7], [9]. Finally, notice that the time budget for testing is usually limited in practice [4], [9]. Therefore, products should be prioritized such that the t-wise coverage provided by these configurations is achieved faster. That way, it potentially increases the probability of improving the fault detection ratio [7]. How do the four algorithms perform in terms of prioritizing products? To answer this question, we plot in Fig. 2 the average t-wise (t = 6) coverage with respect to the number of used configurations on nine representative FMs. For instance, regarding SmartHomev2.2, NS enables covering around 70 percent of the 6-sets on average (over 30 runs) with 50 percent of the configurations. According to [4], the area under the curve reflects the effectiveness of the prioritization. The larger this area, the better the prioritization. As observed, NS performs either (significantly) better than or similarly to its competitors on these FMs. It must be noted that both GA and SamplingDown use a similaritybased prioritization approach to order products. In contrast, our NS algorithm does not explicitly consider prioritization (because we mainly focus on product sampling in this paper. Certainly, any prioritization approach can be applied to the resulting samples). Despite of this, NS can still be effective concerning prioritization. One of the most likely explanation is that configurations generated by NS are already with high quality. Even with the default order, they can achieve large t-wise coverage as early as possible.

RQ2 summary
The following are some conclusions drawn from the above numerical results.
• Generally, the two search-based approaches, i.e., NS and GA, are more effective than the two samplebased approaches, i.e., SamplingDown and Unpredictable. This observation can be explained by the results of the correlation analysis performed in Section 4.1. It has been shown that similarity metrics used in NS and GA have a significantly positive correlation with the t-wise coverage in most cases. Therefore, it is not surprising that the two search-based approaches  are better than sample-based approaches, because the improvement on similarity metrics will potentially increase the t-wise coverage. In contrast, the two sample-based algorithms generate a certain number of configurations in an unpredictable way, without exploring sufficient configurations. As a result, there is no enough selection pressure towards configurations that can contribute to the t-wise coverage. This observation is in line with the finding of the correlation analysis. It has been observed in Section 4.1 that the novelty score used in NS has a (much) stronger positive correlation with the t-wise coverage than the similarity-based fitness used in GA. This can be the primary reason for the superiority of NS over GA.
• Concerning the prioritization capacity, NS is promising even without employing any prioritization strategy, being even more effective than those algorithms in which prioritization is explicitly considered. This indicates the potential of NS in improving the fault detection ratio.

Comparison of different ways of generating new configurations (RQ3)
This part investigates how the performance of NS is affected by the ways (or strategies) in which new configurations are generated. The way of generating new individuals is an important aspect of a search-based approach. Currently, the unpredictable way, realized by the randomized SAT4J solver, forms a state-of-the-art approach. It remains open whether this approach can be further improved by using two types of solvers as in [59], [60], and whether it is necessary to introduce traditional genetic operators. In this section, the answers to these questions will be made clear after a comparative study.

Experiment setup and results
We empirically evaluate the following three strategies of generating new configurations. The first one, as described in Section 3.2 and currently being used, generates each time a random configuration that is to be handled by two different types of SAT solvers in case of infeasibility: probSAT [63] for repair and randomized SAT4J for replacement. In the remainder of this paper, we refer to this strategy as Random+TwoSAT. The second strategy follows a traditional way as in evolutionary algorithms. In this strategy, genetic operators (uniform crossover and single-point mutation) are adopted to generate a new configuration. As in the first strategy, this configuration will be then handled by two SAT solvers. We refer to this strategy as Genetic+TwoSAT hereafter. Comparisons with this strategy allow us exploring how genetic operators perform compared with the simple randomization. Finally, the third strategy relies solely on the randomized SAT4J to generate unpredictable configurations [4]. We refer to this strategy as Randomized SAT4J. Notice that requesting to a randomized SAT4J solver to get unpredictable configurations forms a state-of-the-art approach in the SPL engineering [8], [57], [58], [59], [60], [61]. Choosing this strategy as a baseline enables us to investigate whether it is necessary to employ two different SAT solvers. Considering Table 9, the differences among the three strategies are not statistically significant on 4 out of the 6 large-scale FMs regardless of the value of t. As observed, however, both Random+TwoSAT and Genetic+TwoSAT significantly outperform Randomized SAT4J on freebsd-icse11 and 2.6.28.6-icse11 with respect to all the values of t. Fig. 3 illustrates this behavior in the form of histogram. As seen, the t-wise coverage obtained by the first two strategies are obviously higher than that obtained by Randomized SAT4J, especially on freebsd-icse11. By comparing Random+TwoSAT and Genetic+TwoSAT in Table 9, one can find that the genetic operators perform equally to the simple randomization method in generating new configurations.

RQ3 summary
Experiments performed in this section emphasize that different ways of generating new configurations have similar performance in most cases, but can indeed make a difference in some cases. In particular, the use of two SAT solvers should be encouraged as it can achieve (much) better or at least comparable performance to the state-of-the-art strategy where only one SAT solver is employed. This conclusion is in line with the one drawn in [59]. It is stated that the simultaneous use of two types of SAT solvers is more effective than the use of only one in the context of configuring SPLs [59]. In this paper, we show, for the first time, this is also the case for the similarity-based testing of SPLs. Moreover, when aided by two SAT solvers, ways of generating new configurations make no difference. Indeed, the simple randomization method can be as effective as genetic operators. The former, however, is quite easy to use as it requires no efforts of tuning control parameters, like the crossover and mutation probabilities involved in genetic operators. Therefore, its use is advisable.

Parameter study (RQ4)
From the practical point of view, it is important to provide some suggestions on the setting of the key parameters involved in the NS algorithm. As shown in Algorithm 1, the repair probability (P r ) and the neighbor size (N b ) are two key parameters that needs to be carefully tuned.

Sensitivity analysis on P r
The P r ∈ [0, 1] determines the probability of using probSAT solver, and its impact on the t-wise coverage is investigated by examining some typical values. To this end, we change P r from 0.0 to 1.0 with a step size 0.1, and present the twise coverage over 30 runs in the form of boxplots for each value of P r . This way, it enables us to observe the trend of the average t-wise coverage as P r increases. As observed in Fig 4, the t-wise (t = 2, . . . , 6) coverage slightly fluctuates as P r increases from 0.0 to 1.0 on the two representative smallscale FMs (i.e., WebPortal and CocheEcologico), and tends to descend on the two moderate FMs (i.e., E-shop and axTLS), especially when P r ≥ 0.5. This observation suggests that the value of P r should not be too large. Considering the two large FMs (i.e., freebsd-icse11 and 2.6.28.6-icse11), the t-wise coverage is improved obviously when P r is changed from 0.0 to 0.1, and retains relatively stable for P r ≥ 0.1. This implies that the value of P r should not be too small (< 0.1), either. Considering all the above scenarios, it is reasonable to restrict P r to the region [0.1, 0.5] for a common usage.

Sensitivity analysis on N b
In a similar way, we conduct sensitivity analysis experiments on the parameter N b . We change N b from 10 to 100 with a step size 10. In addition, N b = 2 is also examined. In this case, the novelty score of a solution is determined by its two closest neighbors (including itself). For each FM and each value of N b , the t-wise coverage over 30 runs is exhibited in the form of boxplots. Moreover, we use a green line to connect two adjacent average values. By doing so, it makes easy observing how the average t-wise coverage changes.
We can identify the following three types of curves from Fig. 5.
• Type-I curve, as observed on FMs from Counter-StrikeSimpleFM to WebPortal (note that the order is from left to right, and from top to bottom), rises as N b increases from 2, and quickly reaches a peak around N b = 10 or 20. It then declines until N b reaches 70 or 80. Finally, the curve rises, in general, again till N b = 100. Note that the average t-wise coverage obtained on axTLS and WebPortal roughly follows the above curve, rather than exactly. According to Table 1, the median value of the number of free features for the FMs in this group is 37.5.
• Type-II curve, as observed on FMs from Drupal to ecos-icse11, is unimodal. The only peak is obtained when N b falls in the region [30,50]. Note that the median value of the number of free features for these FMs is 76.
• Type-III curve, as observed on FMs from E-shop to 2.6.28.6-icse11, tends to increase along with the value of N b until it reaches around 80, and then the curve slightly drops till to the end (note that 2.6.28.6-icse11 is an exception, in which the curve always increases). The median value of the number of free features in these group is 2218.
According to the above observations, we are aware that the optimal value of N b seems to be relevant to the model size, more precisely the number of free features. In general, higher value of N b is required to obtain decent performance for larger feature models. This phenomenon can be explained as follows. As the number of (free) features increases, the size of the search space grows exponentially. It naturally follows that more points are need to provide a more precise estimation to the density around a point. Therefore, the number of neighbors should be increased in this case.

RQ4 summary
The parameter study performed brings out the following outcomes. First, the NS algorithm prefers to relatively small P r . In other words, less calls should be given to the probSAT than to the randomized SAT4J. Generally, the algorithm is not sensitive to the value of P r ∈ [0.1, 0.5]. Second, the (nearly) optimal value of the parameter N b varies with the size of the FMs. It is found that higher N b is required for larger models. This is a practical guidance when using NS in the context of SPL testing.

THREATS TO VALIDITY
In this section we briefly discuss threats to validity and how they could be mitigated. We consider threats to internal validity, external validity and construct validity.
Internal validity. This is concerned with any aspect which may lead to bias. Potential errors in the implementation of our algorithm and the tools used for comparison could affect the presented results and lead to this type of threats. To diminish these threats, we carefully tested our implementation of NS by performing unit testing, and by analyzing the outcomes step by step on small FMs. The comparison with existing tools gave us confidence in our implementation. For the algorithms used for comparison, they are implemented by the code provided by their authors. In addition, we make our implementation and experiments data publicly available 10 to enable reproducibility and to reduce the aforementioned threats. Due to the stochastic nature of the techniques under study, the outcome of a run can be different from the next runs. To reduce risks caused by random effects, we independently run each of the algorithms 30 times, and compare the performance based on means and standard deviations. In addition, to make comparisons reliable, nonparametric statistical tests are applied.
Internal validity threats may be also introduced in the computation of the t-wise coverage. Indeed, as discussed previously, it is different or even practically impossible to compute all the valid t-sets for large FMs and high t values. Therefore, valid t-sets used to compute the coverage are sampled. This means that the presented t-wise coverage is not real, except on small and moderate FMs for t = 2 (exact valid t-sets are used in this case). To mitigate these threats, we use the same set of valid t-sets (either sampled or exactly generated) to compute the t-wise coverage of the configurations returned by each algorithm. This way, it enables a relative performance comparison among the algorithms.
External validity. This threat commonly exists in software engineering research, and it is related to the degree to which we can generalize from the experiments. Indeed, we cannot guarantee that the proposed algorithms will provide similar results on different sets of FMs. This threat comes from the size of the feature models, as well as the size of constraints. To mitigate this threat, we use a relatively large set of 31 FMs with the size ranging from 24 to 18,434. Also, the number of constraints varies from the smallest 35 to the largest 347,557. In addition, both realistic and randomly generated feature models are carefully selected to provide a variety of situations.
Construct validity. We choose t = 2, . . . , 6 in the empirical study. This is in line with the previous studies. The pairwise coverage (t = 2) is a widely used test criterion in SPL testing. According to [33], [34], [35], there is also a practical need to deal with high interaction strengths (t > 2). Therefore, choosing t = 2, . . . , 6 is of interest in practice.

RELATED WORK
In recent years, there has been a growing interest in SPL testing, and this is reflected by several surveys and systematic mapping studies on this topic [13], [77], [78], [79], [80], [81], [82], [83]. Some of these surveys and mapping studies cover various factors and aspects of the SPL testing, e.g., [77], [78], [79], [82], while some of them focus on a particular aspect of the SPL testing, e.g., CIT-based approaches [13], [81], EC-based approaches [80] and test coverage criteria [82]. In this work, we focus on product sampling [10] for SPL testing. In Section 6.1, we briefly review some prominent t-wise sampling techniques. Since these techniques can hardly scale to large product lines and/or high interaction strengths, similarity is used as a surrogate metric for the twise coverage, making similarity-based SPL testing popular in this domain. Related similarity-based testing approaches are summarized in Section 6.2.

T-wise product sampling
The Chvatal [18] is greedy algorithm originally proposed to deal with the minimum set-covering problem, an NPcomplete problem. Johansen et al. [12] adapted and improved this algorithm to approximate optimal solutions for the minimal covering array. A t-wise covering array is a subset of products that covers all the valid t-sets [15]. A covering array is typically represented by a table with each row representing a feature and each column representing a product. With Chvatal, configurations are added in an incremental manner until all the valid t-wise feature combinations are covered at least once. More specifically, the algorithm begins with generating all the t-wise feature combinations, denoted by U . Then, an empty configuration is created, and these feature combinations are added into this configuration one by one. Each time a new combination is added, and the resulting configuration is checked by an SAT solver. If the configuration is valid and contains at least one uncovered combination, then it is added to the sample set. At the same time, this newly added combination is removed from U . Otherwise, the combination is removed from the configuration. The creation of configurations continues until all the valid combinations are covered. Built upon the Chvatal algorithm [12] with several enhancements, such as identifying invalid feature combinations early, and parallelizing the sampling process, the ICPL [19] algorithm is able to generate covering arrays for large-scale feature models but it is limited to t = 3. For the largest 2.6.28.6-icse11 model, ICPL is only capable of generating 1-wise and 2-wise covering arrays.
Oster et al. [16] proposed the MoSo-PoLiTe (Modelbased Software Product Line Testing) framework to generate a minimal set of products covering all pairs of feature combinations. This is achieved by combining binary constraint solving and forward checking [84]. The algorithm starts with the first pair of features and iteratively adds the rest ones by applying a forward checking to determine whether the selected pair can be combined with the remaining pairs to create a valid product. Furthermore, MoSo-PoLiTe allows including pre-defined products as part of the covering arrays. Experiments were conducted on nine realistic FMs, with the largest one containing 287 features. Similarly, Hervieu et al. [85] adopted constraint programming to generate minimal 2-wise covering arrays for SPLs.
The Incremental sampLing (IncLing), proposed by Al-Hajjaji et al. [21], follows the general concept of ICPL [19], but incorporates several crucial modifications to improve the performance, such as detecting invalid combinations at the beginning, using feature ranking heuristic, and detecting conditionally dead or core features. With IncLing, the algorithm takes into account already generated and tested products when selecting the next product to be added into the sample. The next product is selected in a greedy manner to cover uncovered feature combinations as many as possible. Moreover, to increase the diversity among products, the algorithm chooses dissimilar pair-wise feature combinations when a new product is generated each time. This way, it potentially increases the possibility of fault detection. Experiment results on a corpus of real-world and artificial feature models demonstrated that IncLing is more efficient than existing sampling algorithms, e.g., Chvatal [12] and ICPL [19]. Although large models (including 2.6.28.6-icse11) are used in their empirical evaluations, the strength is limited to t = 2.
Perrouin et al. [66] proposed a tool for generating t-wise samples from a feature diagram. Using the divide and conquer technique, the problem of t-wise sampling is divided into several sub-problems that are solved automatically. To do so, a given feature diagram and its interactions should be first transformed into a set of constraints into Alloy, a formal modeling language. Then, a set of products are generated using the resulting model, which provide t-wise coverage. The evaluation of this tool is performed on a real-world feature model called AspectOPTIMA.
Garvin et al. [17] used simulated annealing to construct CIT samples of SPLs. The algorithm, named CASA, consists of two parts: outer search and inner search. The outer search, implemented by a binary search procedure, is concerned with the minimization of the sample size, while the inner search, implemented by the simulated annealing, aims at finding a covering array. To this end, one value of the covering array is changed each time and this change is guided by the number of t-sets that are not covered. A SAT solver is used to control whether the change should be accepted or rejected. Moreover, several strategies were incorporated to improve the performance, and experimental results confirmed the usefulness of these strategies. According to [17], CASA can handle relatively high strengths, with up to t = 4 for FMs with no more than 52 features. For models beyond this size, the algorithm timed out after 27 days when t = 4.
Ensan et al. [22] devised, on the basis of genetic algorithms, an approach to generate test suites for SPLs. This algorithm starts with a set of randomly generated configurations, and gradually improves the quality of the test suite by iteratively applying genetic operators, i.e., crossover, mutation and natural selection. As opposed to the aforementioned approaches, Ensan's approach does not guarantee a full t-wise coverage (only t = 1 is considered in [22]) because the fitness function used indirectly measures coverage. It should be mentioned that new individuals developed based on crossover and mutation are not necessarily valid. In this case, invalid ones are discarded and are not included as a part of the new generation. This may lead to scalability issues (according to [4], Ensan's approach does not scale over 300 features). Notice that, in addition to the t-wise coverage, code coverage [86] and mutation scores [87] are also considered as a criterion when applying genetic algorithms to the product sampling process.
Marijan et al. [23] proposed a pairwise testing framework for SPLs. It integrates feature modelling, combinatorial interaction testing, and constraint programming techniques. In particular, the framework first, extracts variability in an SPL as a feature model, and then employs an optimization algorithm to generate a minimal set of valid products covering all pairwise feature combinations for a given time interval. The optimization algorithm is implemented by constraint programming techniques, including a dedicated branch-and-bound algorithm. This proposal was evaluated on an industrial SPL with 78 features. The experiments reveal that this framework can reduce the sample size while guaranteeing a full pairwise coverage at the same time.
Cmyrev and Reissing [24] suggested a new test approach to generate a test suite that is as small as possible, and yields a full requirements coverage and a full feature coverage (t = 1). Since finding the optimal set of test cases is very hard, a greedy algorithm and a simulated annealing approach are explored to find the nearly optimal set in a fast way. Experiments on two case studies (with at most 37 features) in the automotive industry showed that the greedy algorithm is superior to the simulated annealing concerning both effectiveness and efficiency.
According to [28], most of the aforementioned t-wise sampling techniques do not scale to large product lines and/or high interaction strengths. Indeed, small-sized FMs are used in most of the above prior works, considering often t = 1 or t = 2. As discussed in Section 1, there is a practical need of dealing with larger FMs and higher strengths. To alleviate the limitation of existing t-wise sampling techniques, several approaches have been proposed to reduce the configuration space [88], [89]. These approaches, however, often require more domain knowledge to guide the sampling process. As an alternative to the t-wise sampling, the similarity-based sampling, which aims at generating a set of configurations achieving a degree of t-wise coverage by maximizing diversity among test cases, can bypassing the scalability issues faced by the t-wise sampling techniques. Moreover, it does not require more domain knowledge than the feature model, i.e., knowledge on valid combinations of features [7]. In the following part, we give an overview on several prominent similarity-based SPL testing techniques.

Similarity-based SPL testing
In the context of software product lines, similarity has been exploited to partially but efficiently mimic the t-wise coverage for the created product samples. Henard et al. [4] considered the similarity between configurations as the fitness function, and used an (1+1) evolutionary algorithm to optimize it. The use of similarity as a surrogate metric for the t-wise coverage enables the algorithm to generate and prioritize test configurations for large SPLs and high strengths. Results in [4] suggest that two dissimilar configurations are more likely to cover a greater number of valid tsets than two similar ones. This approach is supported by a test generation tool, called PLEDGE [61]. As an application of the similarity testing, Henard et al. [90] evaluated the capability of test suites to kill mutants of feature models, i.e., erroneous feature models derived from an original one. Experiments demonstrated that dissimilar tests suites kill (or detect) more mutants than similar ones, thus validating the effectiveness of similarity-based SPL testing. Further, Fischer et al. [40] showed that Henard's similarity approach is useful to detect interaction faults in the Drupal case study in which real fault data are used.
Al-Hajjaji et al. systematically studied similarity-based prioritization in SPL testing in four related papers [7], [9], [43], [44]. Similarity-based prioritization, aims at increasing interaction coverage as fast as possible over time, selects configurations based on their performance with respect to similarity. The similarity (i.e., distance) between configurations in [7], [9] are calculated by taking the deselected features into account, making it different from the way in [4]. Furthermore, the approach proposed by Al-Hajjaji et al. [7], [9] incrementally selects configurations with the maximum of the minimum distances to those already chosen, rather than with the maximum of the sum of distances as in [4]. According to Al-Hajjaji et al. [7], [9], the sum of distances as a selection criterion may be misleading in some cases. In [43], Al-Hajjaji et al. extended this prioritization approach by considering the delta models, which can be used to reason about the similarity of products on the solution-space level. Results showed that prioritizing products based on delta modeling can improve the effectiveness of SPL testing. In a more recent work [44], they discussed how the similarity between products of an SPL can be measured considering different types of information. In particular, they distinguished the input information for similarity calculation into problem-space information (e.g., feature-selection similarity, attributes similarity, instance similarity), and solutionspace information (e.g., family models similarity). Besides, they discussed the possibility of combining different types of information to form an overall similarity between two products. Finally, the calculation of similarity in different scenarios is implemented in an industrial tool, called pure::variants.
Devroey et al. [45] assessed the SPL behavioural coverage of configurations sampled by two structural testing approaches, i.e., t-wise testing and similarity testing. To this end, they modelled SPLs in terms of feature diagrams and associated featured transitions systems (FTSs), and then computed state, action and transition coverage for a set of sampled configurations. To obtain samples, tools ICPL (t-wise) [19] and PLEDGE (similarity-based) [61] were used. In a subsequent study [46], they provided a configurable search-based approach to support single and bi-objective similarity-driven test case selection for behavioural SPLs. Empirical results on four case studies showed the relevance of similarity-based test case generation for behavioural SPL models. Moreover, they examined different types of distances in measuring similarity between test cases, finding that Hamming and Jaccard distances are the most efficient.
Lity et al. [47] proposed a similarity-based approach to reorder products in the context of incremental product line analysis. Most of the aforementioned coverage-driven product selection/prioritization techniques [4], [7], [9], [43], [45] select the most dissimilar product as the next products to be tested/analyzed in order to increase the coverage of feature interactions as early as possible. Different from this, the incremental analysis requires a product order where subsequent products are more similar to each other. This way, it can potentially reduce the overall analysis efforts. Lachmann et al. [48] employed a dissimilarity measure (Jaccard distance) to avoid clustered test case orders. This is motivated by the fact that, in a single system modelbased testing, dissimilar test cases detect more faults than similar ones [37]. There have been several approaches exploiting similarity to the source code level, e.g., [91], [92], [93]. As pointed out by Al-Hajjaji et al. [44], adapting these approaches that consider the source code into the context of software product line engineering may enhance the analysis of SPLs, including SPL testing.
Similarity has been successfully applied to mimic coverage criteria (e.g., the t-wise coverage) in an effective and scalable manner in prior works [4], [7], [9], [45], [46]. The suitability of similarity as a surrogate metric for the t-wise coverage is mainly demonstrated by empirical results in an intuitive way. No work, to our best knowledge, has been done to reveal the internal relationship between similarity and the t-wise coverage. In this paper, we perform a correlation analysis to explore this relationship, observing that similarity metrics have a significantly positive correlation with the t-wise coverage. This finding consolidates the foundation of search-based similarity-driven SPL testing because improving similarity metrics (e.g., the one used by [4]) consequentially increases the t-wise coverage. Moreover, GA was extensively used in previous works [4], [40], [46], [61]. In this work, we explore, for the first time, the use of NS in similarity-based SPL testing because NS perfectly matches the testing goal, i.e., searching for a set of diverse configurations. We demonstrate that the novelty score has a (much) stronger positive correlation with the t-wise coverage than the fitness function used in GA [4]. This is in line with our empirical results, showing that NS (significantly) outperforms GA in most cases. In addition, we systematically investigate how the performance of NS is affected by its key parameters. Finally, we offer an alternative approach to generate new configurations in search-based SPL testing. This approach exploits two types of SAT solvers as in [59], [60], and it is found to be more effective than the state-ofthe-art strategy [4] where only one SAT solver is used.

CONCLUSION AND FUTURE WORK
Testing SPLs is crucial to avoid fault propagation cross products, but it is challenging because of a huge number of possible products to be tested and a limited amount of test budgets available in practice. The t-wise combinatorial interaction testing (CIT) has been widely used to drastically reduce the size of a test suite, a set of test cases (i.e., products). However, it can hardly scale to large-size SPLs and high interaction strengths. As an alternative to this testing technique, similarity-based SPL testing has shown a great potential in dealing with the scalability issues. However, the rationale of this testing technique deserves a more rigorous exploration. Moreover, genetic algorithms have been extensively used in search-based similarity-driven testing. It is meaningful to exploit the potential benefits of more suitaible/powerful search algorithms in this context.
In this paper, we focus on sampling a set of products as dissimilar as possible by using a search algorithm, called N-S. We show, after performing a correlation analysis, that similarity metrics (including the novelty score used in NS) have a significantly positive correction with the t-wise coverage, which well explains the rationale of similarity testing. Moreover, we give detailed discussions on the suitability of NS for similarity-based SPL testing since NS perfectly matches the goal, i.e., searching for a set of diverse products. The suitability of NS and its superiority over some state-of-theart approaches are demonstrated through comprehensive experimental studies performed on 31 feature models, either realistic or artificially generated, considering t = 2, . . . , 6. In particular, it is found that NS significantly outperforms the state-of-the-art genetic algorithm [4] in 77% of all the test scenarios. Finally, concerning the way of generating new configurations, our results suggest that the use of two SAT solvers should be encouraged (while the use of genetic operators can be optional) in the context of similarity-based SPL testing.
Our results show that NS is promising for similaritybased testing of SPLs. In the future, it is possible to integrate objectives functions [8], such as test suite cost, coefficient of connectivity-density, into the framework of NS to handle objective requirements. Currently, we focus on using NS to generate product samples, without explicitly considering prioritization. Developing NS-based techniques which take both aspects into consideration is one of our subsequent studies. Finally, we intend to investigate whether looking for novelty can bring abundant benefits in other topics related to SPL testing, such as mutation-based test case generation [87], and behavioural SPL testing [46]. Miqing Li is an assistant professor at School of Computer Science at the University of Birmingham. His research is principally on multiobjective optimisation, where he focuses on developing population-based randomised algorithms (mainly evolutionary algorithms) for both general challenging problems (e.g. manyobjective optimisation, constrained optimisation, robust optimisation, expensive optimisation) and specific challenging problems (e.g. those in software engineering, system engineering, product disassembly, post-disaster response, neural architecture search, reinforcement learning for games).