BIAS: A Toolbox for Benchmarking Structural Bias in the Continuous Domain

Benchmarking heuristic algorithms is vital to understand under which conditions and on what kind of problems certain algorithms perform well. Most benchmarks are performance based, to test algorithm performance under a wide set of conditions. There is also resource- and behavior-based benchmarks to test the resource consumption and the behavior of algorithms. In this article, we propose a novel behavior-based benchmark toolbox: BIAS (Bias in algorithms, structural). This toolbox can detect structural bias (SB) per dimension and across dimension-based on 39 statistical tests. Moreover, it predicts the type of SB using a random forest model. BIAS can be used to better understand and improve existing algorithms (removing bias) as well as to test novel algorithms for SB in an early phase of development. Experiments with a large set of generated SB scenarios show that BIAS was successful in identifying bias. In addition, we also provide the results of BIAS on 432 existing state-of-the-art optimization algorithms showing that different kinds of SB are present in these algorithms, mostly toward the center of the objective space or showing discretization behavior. The proposed toolbox is made available open-source and recommendations are provided for the sample size and hyper-parameters to be used when applying the toolbox on other algorithms.


I. INTRODUCTION
T HE MODERN world has an ever-growing need for good heuristic optimization algorithms due to the large amounts of data and increasingly difficult problems to be optimized. Since one overall best heuristic does not exist [1], we have to benchmark heuristics to understand which one is better under what conditions. Benchmarking can be performance based, resources based, or behavior based. Most benchmarks for continuous optimization are performance based. An example is the well-known black-box optimization benchmark (BBOB) [2] test suite. Performance-based benchmarks aid in learning about the performance of one algorithm and the comparison between other algorithms in different situations. For example, one algorithm could perform very well on separable functions and another algorithm on uni-modal, high-conditioned functions. Resource-based benchmarks show the amount of resources (computation power/memory/energy) required under certain conditions. However, these types of benchmarks do not easily allow the analysis of the "behavior" of these algorithms under different circumstances. Behavior is, for example, how a population of candidate solutions move in a swarm optimization algorithm either dependent or independent of the function to optimize. An example of a behavior-based benchmark is RobustBench [3], an adversarial attack benchmark designed to test the robustness of image classification deep neural networks. Behavior-based benchmarks can be used to learn additional information about the behavior of an algorithm under different conditions. Here, we concentrate on a particular kind of behavior-based performance estimate, structural bias (SB). SB is a form of bias inherent to the iterative heuristic optimizer in the objective space that also affects the performance of the optimization algorithm.
Detecting whether, when, and what type of SB occurs in a heuristic optimization algorithm can provide guidance on what needs to be improved in these algorithms, besides helping to identify conditions under which such bias would not occur. In many cases, SB can be avoided by slightly redesigning an algorithm component or by using different hyper-parameters. However, SB is hard to detect when not specifically looking for these kind of issues. Therefore, in this article we propose a toolbox to automatically benchmark continuous optimization algorithms to discover a portfolio of eleven different SB scenarios. We aim to answer the following research questions.
RQ1: How to determine whether a heuristic continuous optimization algorithm suffers from SB? RQ2: How to determine the type of SB suffered by a heuristic continuous optimization algorithm? RQ3: Which heuristic continuous optimization algorithms suffer from what type of SB? Under what conditions? To answer RQ1, we evaluate a large set of statistical tests based on 1500 repetitions of 194 different artificially generated parameterized distributions containing 11 different scenarios of SB. The proposed toolbox can be used to detect potential bias issues in newly developed algorithms as well as benchmark already existing optimization algorithms. To answer RQ2, we propose a machine learning approach to identify which SB scenarios are most likely occurring in a given optimizer based on the resulting test statistics. The complete SB benchmark statistical test suite (BIAS), data generators for different SB scenarios, and the machine learning model for identifying different SB scenarios are provided opensource [4]. To answer RQ3, we adopt the proposed toolbox to detect SB in 432 different continuous optimization algorithms. We also discuss insights provided by the resulting benchmarking of these algorithms and potential future research directions.
This article is structured as follows: in Section II, SB is introduced and explained in detail. The main components of our proposed BIAS toolbox are also briefly introduced. In Section III, all considered statistical tests for uniformity are introduced, including several newly proposed tests specifically for SB. In Section IV, the SB scenarios and experiments are explained in detail. In Section V, an analysis of the statistical tests is conducted, answering RQ1. In Section VI, the machine learning approach used to identify the type of SB is introduced and evaluated, answering RQ2. In Section VII, existing heuristic continuous optimization algorithms are benchmarked, answering RQ3. Finally, we end with recommendations and conclusions in Section VIII.

II. STRUCTURAL BIAS
Put simply, algorithms are tools applied to particular problems. Just like a good hammer working well on many nails, we would like algorithms to ideally deliver good solutions for many problems. However, as postulated by the so-called No-Free Lunch Theorem, no best optimizer exists over all possible optimization problems [1]. What is possible though is algorithms specializing on certain kind of problems. Identifying sets of such suitable problems for each algorithm is not a trivial problem [5] and many landscape features have been investigated to understand the success of some algorithms on some kinds of functions [6]. However, what cannot practically be a feature uniting such problems is the location of optima in the domain-having an algorithm that consistently finds an optima only located in the origin is of no use. Therefore, good algorithms should not be biased toward specific locations of the search space, e.g., toward solutions at the origin, center, or in the borders of the search space. Extrapolating such reasoning, a good optimization algorithm should be able to find the optima regardless where exactly they are located within the domain. Or, even stronger, a good algorithm should ideally locate solutions anywhere in its domain with equal "effort." In case of iterative optimization algorithms, points sampled during the initialization "move" within the domain defined by its boundaries under the influence of algorithm's operators and potentially bring improvement of the values of the objective function during the optimization run, following some kind of "survival-of-the-fittest" logic. In effect, such movement of the algorithm toward the optima gets steered by the differences in the values of the objective function in the sampled points or their derivates of some kind. Any feedback that is external to the objective function or domain knowledge might hinder such progression to the optima. Such external feedback stemming from the iterative nature of the algorithm is referred to here as SB.
Because of the high interdependence between the fitness landscape and the information on the fitness obtained from this cyclical application of the algorithm's operators, the SB contribution during the search for optima cannot be easily unveiled if not by means of a specific objective function capable of nullifying such interaction over multiple optimization runs. The f 0 function, first introduced in [7], serves this purpose and can be used to decouple these effects, thus separating the SB component, arising from algorithmic design choices, from the main driving force represented by the sampled differences in the fitness landscape. Function f 0 is a "truly" random problem, having the simple analytical representation reported in the following 1 : (1)

A. Existing Methodology for Measuring SB
As explained in Section II, a heuristic optimization algorithm that does not exhibit SB should be able to find randomly placed (uniformly distributed) optima for the special objective function f 0 with equal difficulty/ease. Therefore, the problem of determining whether a heuristic optimization algorithm suffers from SB can be reduced to the problem of checking whether the best solutions found by multiple runs of this algorithm to minimise 2 f 0 are uniformly distributed. However, such uniformity check is not a trivial task due to the many forms in which SB can manifest itself for different algorithms (see Section II-B) and under different algorithmic configurations/parameter settings. To complicate things further, challenges encountered while testing for SB depends on the experimental setup used to execute optimizers, e.g., where the number of performed runs might be adequate for classic performance-based assessment but not for detecting SB. Moreover, testing for SB also depends on the allowed computational budget allocated to the algorithm and/or employed termination criterion. Indeed, the study in [8] has shown that when SB emerges during the evolutionary process (for several versions of the Differential Evolution algorithm), it only grows stronger during the remaining evaluations. This means that practitioners using different experimental settings might end up looking at biases of different strengths for the same algorithm, which makes direct comparisons unfair. These details can make it difficult to generate a practical and robust procedure for the SB detection.
Up until now, methods to check the uniformity of the distribution of best solutions over multiple runs included visual or statistical inspections, briefly discussed in Sections II-A1 and II-A2, respectively.
1) Visual Test: Displaying locations of the best solutions collected from multiple runs in the so-called "parallel coordinates" [9] appears to be the most effective way for 1 Throughout this article we refer to 1-D uniform sampling within [a, b] as U (a, b) and to sampling from the normal distribution with mean μ and standard deviation σ as N (μ, σ ). 2 Or maximize.
visualizing SB on a multidimensional problem [7]. This approach is easily reproducible, graphically valid, and hence convenient. However, when a large number of images is generated [8], [10]- [12], visual inspection can become too laborious. Such approach is also clearly subjective and therefore not reliable for cases where graphical artefacts or unclear patterns cannot be judged by a naked eye.
2) Statistical Test: Let us consider a heuristic optimization algorithm that was run N times to solve the problem of minimizing f 0 . At the end of each run i, the best solution x (i) found by the algorithm by the end of the run is recorded, where naturally x (i) ∈ [0, 1] n . The random sample {x (1) , x (2) , . . . , x (N) } represents the set of best solutions retrieved by the N runs of the algorithm. Assume that {x (1) j , The Kolmogorov-Smirnov test [13] was first employed with a sample of size N = 50 and significance level α = 0.01 in [7]. Subsequently, following the "power analysis" performed in [14] across three common tests, namely, Kolmogorov-Smirnov, Cramér-Von Mises [15], and Anderson-Darling [16] tests, the latter test was chosen and used in combination with the Benjamini-Hochberg [17] correction method for multiple comparisons to achieve higher statistical power. However, it was noted that the original sample size was not adequate for testing all algorithms under investigation. Hence, a higher number N = 100 of runs had to be used for some algorithms in order to achieve a satisfactory level of statistical power. Similar problems were encountered in a further study on SB in a subclass of Estimation of Distribution Algorithms [18], even when using an aggregated measure of SB defined as the sum of the statistically significant (across all dimensions) test statistics of the Anderson-Darling test.
It was concluded that the described statistical approaches can effectively detect most cases of "strong" SB but are deficient on other scenarios, including clear "mild" SB that can be identified with the visual approach. Reasons for these discrepancies between the two methodologies might be a conservative nature of the employed tests combined with the relatively low sample size. Indeed, more accurate SB detection results could be obtained with N = 600, as used in [8], instead of N = 100.
Using a large sample size (N = 600) indeed seems to catch bias more often, but still gives no guarantee to detect all different kinds of SB, at any significance level [8]. Even larger sample sizes are necessary for smaller levels of significance, higher desired power, and smaller sizes of the deviations to be identified [19]. However, given the limited computational resources to run heuristic optimization algorithms, it is not always possible to obtain (very) large sample sizes. Therefore, tests better able to detect significant deviations from uniformity given limited sample sizes are desirable for detecting SB. We study this in more detail in Section V-B.
Regardless of the above, there is a clear need for a better automated statistical testing procedure to detect SB readily available to the community as a software package.

B. Known Deviations From Uniformity Due to SB
Formulating a good statistical measure of SB has turned out to be difficult (see Section II-A2) due to a wide range of potential deficiencies of distributions of locations of final best points produced by the algorithm in a series of independent runs-i.e., in what aspect samples from such distributions deviate from uniformity.
The following deviations have been observed for different kinds of iterative heuristic optimization algorithms [7], [8], [10], [14], [18], [20]: 1) in terms of proportion of occupied continuous domain: a) sample values do not span the whole domain-the sample is uniform on an interval smaller than [0, 1] ("centre-bias"); b) values in a sample cover discrete set of values (regularly spaced) within [0, 1] ("discretization-bias"); 2) in terms of locations of points, sample points exhibit: a) local clustering within the domain or clustering around several clusters within the domain ("clusterbias"); b) clustering in certain parts of each dimension: on the bounds, in one or both sides of each dimension ("bound-bias"); c) one or more large empty gaps consistently identified in all dimensions ("gap-bias"); 3) in terms of correlations between different dimensions of the sample points: present or not.

C. BIAS Toolbox
Our proposed BIAS (bias in algorithms, structural) toolbox fills in the gap in terms of the need for a better-automated procedure to detect SB and identify its type. In this article, we discuss the different components which constitute the proposed BIAS toolbox. These components include 39 statistical tests and a procedure to aggregate them for detecting SB (Section III), a random forest (RF) model to identify the type of SB (Section VI), the code for the used f 0 (which was described in Section II) and the code for the generation of the statistical bias decisions and plots (of which an example is illustrated in Fig. 3).
The BIAS toolbox is available as a python package at [4]. It provides a clear decision on whether or not SB is present in the sample of final positions provided by the user and, in case some SB is found, an assessment on the possible type of SB observed in the sample using an RF model. In addition to the SB detection, the toolbox also contains the needed functionality to sample data from the scenarios in Table I, thus providing the means to benchmark other statistical test for detecting SB.

III. TESTS
As explained in Section II-A2, the goodness-of-fit test previously adopted for detecting SB (Anderson-Darling) is deficient on several cases where SB is clearly visible even with a sample size of 100. More powerful tests are therefore desirable. This section introduces existing and newly proposed goodness-of-fit tests evaluated in this article for the purpose of detecting SB in samples of a limited size. This wide selection of tests includes both classical statistical tests and other tests that have demonstrated competitive power for testing uniformity in the statistical literature [21], [22]. All of these tests are used to test the null hypothesis H 0 : F d ∼ U (0, 1) introduced in Section II-A2.

A. Tests Per Dimension
The following tests are designed to work on an individual dimension. However, by aggregating all data we can run these test on a sample size which is effectively n times larger. If these tests are run on a per-dimension basis, correction strategies need to be applied to deal with the multiplecomparison problem. For this purpose, the Benjamini-Holberg (BH) method was proposed originally, since it is less stringent than the standard Bonferroni method. However, in this work, we also investigate the effects of other multiple comparison correction methods, which is discussed in Section V-D.

1) 1-Spacing-Based Test (1-Spacing) [4], [23] 3 :
To detect cases where the bias takes the form of large spacing differences or clustering in each dimension, we can test the distribution of distances between consecutive points (or points with m neighbors between them, m ∈ {1, 2, 3} here and in two subsequent tests). Such distribution can then be compared to the distribution from a large sample of truly uniform random samples using a 2-sample KS test. This and two subsequent tests in the list belong to a wider class of tests called m-spacing.

4) Sample Range-Based Test (Range) [4] 3,4 :
The range taken up by the samples in each dimension can be useful in detecting SB where the points are far removed from the boundaries. The same can be done using the sample extrema (next two items).

8) Transformed AD Test (tAD):
To boost the power of the AD test in some of the most common cases of SB, where the bias occurs near either the boundaries or the center of the space, a transformation can be applied to the samples by taking their distance to the nearest boundary [20], [24]. This can then be tested using an AD test with domain [0, (1/2)]. 9) Shapiro Test (Shapiro): Instead of testing the samples for uniformity directly, we can transform them to their normal counterpart, and check for normality using tests like the Shapiro test [25], [26].
10) Jarque-Bera Test (JB): This test, obtained through the use of the Lagrange multiplier test on the Pearson family of distributions [27], [28], is also applied to transformed samples to test for their normality.

11) Minimum Linear Distance-Based Test (LD-Min) [4] 4 :
The minimum distance between a set of samples and a linearly uniform distribution using the same range and sample size.

12) Maximum Linear Distance-Based Test (LD-Max) [4] 4 :
The maximum distance between a set of samples and a linearly uniform distribution using the same range and sample size.

13) Kurtosis-Based Test (Kurt):
This test 3,4 measures how differently the tails of a distribution are shaped, compared to the tails of the normal distribution. Kurtosis is defined as the fourth standardized central moment, of the random variable of the probability distribution [26], [29]. Note that this test is applied after transforming the data to its normal counterpart.
14) Minimal Minimum Pairwise Distance-Based Test (MPD-Min) [4] 3,4 : The minimum pairwise distances between neighbors. If the minimum distance between neighbors is too small, this can indicate dense clusters. [4] 3,4 : Same as MPD-min but now comparing the maximum instead. If the maximum distance between neighbors is too large, this indicates uncovered areas of the objective space.

16) Wasserstein
Distance-Based Test (Wasserstein) [4] 3 : Originally proposed for optimal allocation problems [30] and rigorously formulated by Kantorovitch in [31], this metric is used to measure the distance between two probability distributions defined on a metric space. This result is exploited in this study to build the Wasserstein test for uniformity.
18) Kolmogorov-Smirnov Test (KS) [13], [24]: Is a nonparametric test based on the maximum distance between the empirical cumulative distribution function (ECDF). While this test is most commonly used to compare two samples, it can also be used for goodness-of-fit testing.
19) Cramer-Von Mises Test (CvM) [22], [34]: Is similarly based on the expected squared difference between the ECDF and true CDF function of the null distribution.

20) Durbin Test (Durbin):
To test for uniformity, the Durbin C n test [35] compares the cumulated periodogram graph of the residuals from a least-squares regression with the Kolmogorov-Smirnov limits. This is based on previous results showing that when the employed test statistic is calculated with a generic number 2m + 1 of samples then it is distributed as the mean of m − 1 independent uniform variables [22], [36].

21) Kuiper Test (Kuiper):
This is an extension of the standard Kuiper test obtained by replacing the original distribution with Pyke's modified empirical distribution function as proposed in [37] and implemented in [22]. This makes the test suitable for showing that a population has a prescribed continuous distribution function, including uniform.

22) 1st Hegazy-Green Test (HG1):
This test refers to the variant of the Hegazy-Green tests for uniformity proposed in [38] and implemented in [22] employing the T 1 test statistic defined as the average of the absolute differences between samples and their expected value.

23) 2nd Hegazy-Green Test (HG2):
This is the second variant of the Hegazy-Green test in [38] and [22] employing the T 2 test statistic defined as the average of the squared differences between samples and their expected value.

24) Greenwood Test (Greenwood):
This test is based on a spacing statistic that can be used to evaluate events in time or locations in space by testing how the intervals between them are distributed [22], [39].

26) Read-Cressie Test (RC):
This test [22], [41] belongs to a family of goodness-of-fit tests based on the power divergence statistic.

27) Moran Test (Moran):
This test [22], [42] operates via the distribution of the sum of squares of intervals into which the domain is divided, using a homogeneity of variances test.

29) 2nd Cressie Test (Cressie2):
This test [22], [44] operates via the family of statistics based on m-spacing gaps, obtained by summing a suitably regular function of each spacing gap.

30) Vasicek Test (Vasicek):
As entropy can be used to characterize distributions, this test uses an estimate of entropy based on higher order (m > 1) m-spacings to test for uniformity [22]. Entropy had been originally adopted by Vasicek [45] for testing the hypothesis of normality.

31) Swartz Test (Swartz):
Instead of using entropy, de Micheaux and Tran [22] and Swartz [46] use the Kullback-Leibler information to derive a statistic to test for uniformity. This statistic is also based on m-spacings.

32) Morales Test (Morales):
This test [22], [47] compares an empirical and a hypothetical distribution based on the limit laws for a statistic based on φ-disparity [48], [49] of m-spacings. 33 [13] and other new goodness-of-fit test statistics that were demonstrated to be more powerful [51]. Zhang1 is derived based on this parameterized test statistic using likelihood ratio statistics [22], [51].
36) 2nd Zhang Test (Zhang2): This test statistics [22], [51] is derived in a similar way to Zhang1, but is based on an approximation.

B. Tests Across Dimensions
In addition to the tests which work on a per-dimension basis, we can also perform tests on the full set of 30-D data at once. This can be done by grouping together the samples or distances and performing the same test as the per-dimension testing on the aggregated data. For this purpose, we use all tests discussed above, with the exception of the sample limitsbased tests, LD, MPD, and Wasserstein tests. It can also be done by using across-dimension tests. In this work, we use the following across-dimension tests.

1) Mutual Information-Based Test (MI):
This test 3,4 is based on the fact that mutual information [52], [53] between variables of a random distribution should be close to zero. If the MI between two variables is higher, this suggests bias toward specific values in the domain shared by these two variables (dimensions). The median mutual information is the median over all dimensions.
The mutual information between two random variables U and V is defined as follows:

2) Maximum Minimum Pair-Wise Distance-Based Test (MMPD) [4] 3,4 :
This test is the maximum distance between two neighboring points in a (multi) dimensional space. This distance should not be significantly different from the same distance metric over a random uniform sampling. A significant higher distance means that there is a region in the search space where the algorithm is more attracted to (and subsequently a region that is avoided). In another words, bias.

3) Maximum Difference Per Dimension Between Linear Uniform Distribution-Based Test (MDDLUD) [4] 3,4 :
This test is the multidimensional equivalent of the LD-max test, where we aggregate either using the maximum or median across all 30 dimensions.

C. Critical Values for All Tests
To determine rejection based on the test statistic, we need to either calculate the corresponding p-values, or check if the test statistic exceeds the corresponding critical value. Several of the test we include calculate the p-value by default, but for the others we will use the critical values. To get accurate estimates of these values, we use a 100 000 samples Monte Carlo simulation of the test statistic under the uniform distribution, from which we determine the α-quantiles for α ∈ {0.01, 0.05}. The Monte Carlo test is a well-known procedure for implementing hypothesis tests [54]. It enables calculating the critical values when the true (sampling) distribution of the test statistic is unknown. The resulting critical values calculated using this procedure are available at [55].

IV. METHODOLOGY
To effectively judge the performance of the proposed tests for different types of SB, we have defined a large portfolio of bias scenarios according to which we can generate an arbitrary number of samples. This set of scenarios is chosen in such a way that all common types of SB discussed in Section II-B are represented. Additionally, these scenarios are parameterized to control the level of bias, which enables us to better judge the robustness of tests.

A. Portfolio of Scenarios
We choose the scenarios to include based on the most common types of SB (see Section II-B) as observed in previous papers.
1) Bias toward the center of the search space.
2) Bias toward the bounds of the search space.
3) Bias toward certain parts of the search space forming clusters. 4) Bias toward avoiding certain parts of the search space and creating gaps. 5) Strong discretization.
In total, this gives us 11 distinct scenarios (13 for the acrossdimension case)-the full list these scenario definitions and their parameters used in this article is shown in Table I. An example of the density of several parameterizations of scenario Cut Normal is shown in Fig. 1.
In Fig. 2, we show that there exists minimal overlap between the densities of the different parameterizations of these scenarios by collecting 10 000 samples and performing pairwise KS tests. Note that these samples are aggregations of multiple independent dimensions, so for Gaps this means that each set of 100 samples has its own gap-centers, which explains why it is seen as similar to Uniform.
In total, this gives us 194 scenarios to consider in the perdimension case, and 249 scenarios in the across-dimension case. For each of these scenarios, we generate data with sample sizes {30, 50, 100, 600}. In the per-dimension case, we collect For each of the generated sets of samples, we apply the corresponding test battery from Section III with α ∈ {0.05, 0.01}: 36 tests for per-dimension case and 32 for the across-dimension case. Using this setup, we thus collect 194 · 1 500 · 4 · 36 = 4.19 × 10 7 test statistics/p-values for the per-dimension tests, and 249 · 100 · 4 · 32 = 3.19 × 10 6 for the across-dimension tests.
We show an example of the set of statistical tests applied to an instance of the Cut Normal scenario in Fig. 3. This figure shows the rejections for each dimension individually, as well as the corresponding sample on which this decision is based. This visualization is available as part of the toolbox as described in Section II-C, and provides a visual way to inspect the SB present in the scenario.
For the analysis on the per-dimension test, we do not directly apply multiple correction methods. However, in Section V-D, we will analyze the effect of different correction methods on the overall false positive rate and select which method to use in practice.

V. ANALYSIS OF STATISTICAL TESTS APPLIED
TO SCENARIOS This section analyses the statistical tests presented in Section III in terms of their suitability to be included in the BIAS toolbox for the purpose of detecting SB.

A. Robustness of Tests to Scenario Parameters
To analyze the effect of different parameterizations of the selected scenarios on the difficulty of detecting bias, we can consider the overall number of rejections for a single test across all parameter settings. An example for the Inverse Cauchy scenario is shown in Fig. 4. In this figure, we see clearly that extreme parameter settings (highly shifted mean and low variance) are always detectable by the AD test, while the tAD test performs much better when the distribution is centerd.
While this granular view of results can give important insights, it is impractical to use this low-level view to investigate the different impacts of our experiment settings on the final rejections. For the sake of completeness, the figures for all scenarios and all experimental settings have been uploaded at [56].

B. Sample Size
To study the impact of the available sample size on the overall performance of different statistical tests, we can aggregate the number of rejections over all parameterizations of each scenario. This allows us to show the fraction of cases of a scenario which are rejected by each test given a certain sample size. Fig. 5 shows this for the Cut Normal scenario. From this figure, we can see that the effect of sample size is not the same across all tests. As an example, the AD test has a relatively high number of rejections at 30 samples, but does not reach the same precision as other tests when increasing sample size to 600. This indicates that analysis of the performance of the tests should take the number of available samples into account, as this will influence which tests are more distinguishing.
From Fig. 5, we can also see that the Moran test significantly outperforms any others on this scenario, but even this test does not reject all cases when the sample size is small. This reinforces the notion that if possible, increasing the sample size is beneficial to the ability to detect less clear cases of SB. However, we also note that for most scenarios, a sample size of 50 seems to be sufficient to detect the presence of SB. While increasing the sample size would increase the ability to detect less obvious cases of SB, N = 50 should be able to correctly identify the most blatant ones.

C. Overall Analysis
With the rejection data, we can investigate the interplay between statistical tests and the scenarios, in order to find what set of tests is more suitable to each kind of SB. For this analysis, we make use of the concept of Shapley values [57] to assess the contribution of each test to a portfolio of tests for finding bias in each type of scenario. In particular, we define the marginal contribution of a test t to a portfolio of tests T ⊂ T on scenario S as follows: (2) where n is the number of realizations s i of scenario s. The indicator function 1 corresponds to the test t rejecting the null hypothesis with significance α on the data from realization s i . Based on this definition of marginal contribution, we can compute approximate Shapley values by sampling random permutations calculating the marginal contribution for each test at each position within this permutation [58], [59]. This can where r is the number of repetitions used and m is the maximum size of these permutations, which is introduced to ease with computations and because the impact of larger permutations on the total sum is relatively minor-in this article, we set m = 10.
Since the used definition of marginal contribution is commutative, we can sum the individual Shapley values for each scenario to get the overall Shapley values across scenarios. These values are then shown in Fig. 6, where we can see that most tests have a very comparable contribution to the overall rejections, with relatively few outliers in both positive (e.g., Moran) and negative (e.g., MPD-min) sense. Because of this, we consider all test to have their uses in the portfolio, and refrain from removing any tests from consideration.
Additionally, we can also consider the per-scenario Shapley values to find a relation between the statistical tests and the scenarios which it can most effectively distinguish. This is visualized in Fig. 7. There are some clear patterns visible in this figure, e.g., for the scenarios which mimic poor discretization, where the Moran and Cressie1 tests have a very high contribution, while their impact on the other scenarios seems to be relatively minor. This highlights the benefit of having a large portfolio of different statistical test to identify SB.
Overall, we have seen that no single test is clearly preferable over all others. Moreover, an analysis of the Kendall-Tau [60] correlations between the rejections of tests across all scenarios show that very few tests are highly correlated. Fig. 8 shows the correlation heatmaps for two representative settings of sample size and α. We can observe relatively higher correlations among some of the tests listed from NS to Greenwood, and among some of the tests listed from QM to Pardo, in all settings. However, these higher correlations involve very few of the tests, and overall these correlations tend to get slightly weaker as the sample sizes get smaller. Moreover, the correlations among the other tests are very low for the largest sample size of 600. Therefore, it is likely that different tests are best suited to recognize different types of deviations from uniformity. For this reason, we will include all considered tests in the BIAS toolbox.

D. Multiple Comparison Correction Methods
For the per-dimension tests, we should take into account the fact that multiple tests are being done, and thus the p-values should be changed using a correction procedure. For this purpose, we consider three methods: 1) BH [61]; 2) Benjamini-Yekutieli (BY) [62]; and 3) Holm-Bonferroni (Holm) [63].  For each of these methods, we consider their impact on 30-D samples of the uniform distribution to judge the false positive rate. In Fig. 9 on the left, we see the false positive relative the used α values (defined as "at least 1 rejection in the 30-D sample"). This figure clearly shows the need for a multiple comparison correction method, given that the false positive rate is way over α (shown as dotted lines) when using no correction methods. We also note minor differences between the correction methods considered, with BY leading to the lowest false positive rate.
Similarly, we can also consider the overall false negative rate as an aggregation over all scenarios, as is done in the right  Table I. This figure shows 15 tests with the most rejections (when aggregated over the different sample sizes). Note that the negative space over each bar (1 − x) is equivalent to the false negative rate of the test. part of Fig. 9. This figure shows that the impact of the choice of the multiple comparison correction method on the false negatives is relatively minor. Because of this, we set the BYmethod as the default, since it is better able to produce a false positive rate lower than α.

E. Results on Correlated Samples
Within the set of scenarios, there are two sets which are enabled only for the across-dimension tests. These scenarios are the Consistent versions of existing scenarios Gaps and Clusters. In Fig. 10, we show the difference in test rejections between these two versions of the scenario. It is clear from the figure that aggregating samples from clusters with different intiantiations on each dimension removes the ability of many tests to detect SB, while the across-dimension tests such as MDDLUD do not have this issue. Additionally, since the spacing tests aggregate spacings per dimension instead of samples, their effectiveness is not reduced when the clusters are inconsistent. For the sake of brevity, the other results on the across-dimension tests are not discussed here, but the relevant figures and data is made available at [56].

VI. ESTIMATION OF THE SB TYPE
Since we use the results of many statistical tests to find bias in artificially generated samples and different tests may be better at capturing different deviations from uniformity, we can use these tests to not only check if SB is present, but also to identify what the most likely form of bias is. This provides an answer to RQ2. To achieve this, we build an RF model, which takes as input the test-rejections from all per-dimension tests. This is done to allow scaling to arbitrary dimensions while having one model for all sample sizes. Specifically, if we use statistical test values directly, we would need one model per sample size, and a way to aggregate the resulting predictions. Instead, an RF based on rejections only needs to deal with the aggregation problem.
The data used to train and evaluate the RF model consists of the full set of scenario results (per-dimension version) on all tests, with the output being the scenario type it comes from. However, if for a specific sample no test rejects the null hypothesis, these samples are discarded, since we have no evidence of SB. This two-stage approach leaves us with 1 158 000 biased samples, on which we train the RF model with 100 trees and balanced class weights. For the sake of clarity and reproducibility, we make the training data-set available in the online repository [55]. A confusion matrix created from an 80-20 test split is shown in Fig. 11 (F1-score of 0.56). Similarly to Fig. 2, we see that the distinction between the Cut Uniform and the other scenarios can be challenging to accurately detect. However, this does not have to be an issue for practical detection of SB, since the scenarios misidentified as Cut Uniform might show similar types of bias, even tough their initial creation mechanism is different.
To provide a more practical estimation of SB in our toolbox, we create an additional model to predict the type of bias, as shown in the final column of Table I. These five categories are more distinct from each other, removing overlap between some similar classes, i.e., between Spikes and Noisy Spikes. Overall, this model gives us an improved F1-score of 0.79 on a similar 80-20 split.
To use these models in the BIAS toolbox to predict bias of the multidimensional test, we need to perform some aggregation across dimensions to transform it into a binary vector. We do this by checking the number of false positive tests in 30-D uniform samples. We run 10 000 simulations, where we record the maximum number of test rejections by each test. This gives us a total of 92 cases where a test gives two rejections, and two cases where a test gives three rejections. As such, we set the threshold for the aggregation of multidimensional data to 0.1 · n. If no test is rejected in this aggregation, we consider the samples to be nonbiased. This threshold value is then used to create the binary input vector for the RF model.
To verify that this works for other dimensionalities as well, and to gauge the overall performance of the toolbox, we simulate the false positive and false negative rates. This is achieved by sampling (with replacement) from the set of test statistics on each of our used scenarios and applying this aggregation rule. For false positives, this is done 100 000 times on the (true) uniform data, while for false negatives it is done 10 000 times on every nonuniform scenario. The results, shown in Fig. 12, indicate that while the 0.1 · n threshold is rather conservative on higher dimensionalities, the FPR is well below the selected α = 0.01, while the FNR is not needlessly increased.

VII. BENCHMARKING SB OF REAL ALGORITHMIC DATA
This section benchmarks a large set of heuristic optimization algorithms by applying the BIAS toolbox, answering RQ3.

A. Data Collection Setup
We use data from a heterogeneous pool of heuristics executed over f 0 at dimensionalitly n = 30 for a maximum of 10 000 · n fitness functional calls. In total, we consider 432 optimization heuristics, which fall into the following categories (all except the latter use N = 100, while the latter uses N = 50 runs each).
3) Single-solution algorithms (60 configurations). 4) Variants of Genetic Algorithms (96 configurations). For the sake of clarity and reproducibility, the exact composition and set-up of these algorithmic configurations is described fully in a dedicated document available from [64].

B. Results
For each of the considered algorithm configurations, we collect their final positions and feed these into the BIAS toolbox. In Fig. 13 (left side), we show the outcome from the RF predicting the type of SB present in the different GA configurations (only the biased ones are shown). This shows that there are quite some differences in the detected bias, even within this limited algorithm design space. It is also interesting to note that the population size seems to have a relatively small impact on the type of predicted bias, which seems to be mostly impacted by the operator configuration.
For the single-solution algorithms, we see in the right part of Fig. 13 that the strategy of dealing with infeasible solution seems to drastically change the type of detected bias. For example, the Powell algorithm is classified as "discretization" Fig. 11. Confusion matrix for the RF model trained on test rejections, aggregated over all sample sizes. The true scenario is shown on the y-axis, while the predicted scenario is on the x-axis. bias when using the mirror strategy, while the classification changes completely with a COTN strategy. Such differences can give us useful insight into the effect of these SDIS methods on the optimization behavior of these algorithms.

VIII. CONCLUSION
Behavior-based benchmarking is a great way to better understand heuristic algorithms. We have proposed a new behavior-based benchmark, BIAS, in order to detect different scenarios of SB in optimization algorithms. The BIAS toolbox consists of 36 per-dimension tests and 32 across-dimension tests to detect SB (RQ1). These statistical tests are selected based on an elaborate literature research; some of these tests are specifically designed for this toolbox by the authors. The tests are compared and analyzed using different sample sizes and hyper-parameters to detect 11 different scenarios of SB. In addition to the tests, a generator of the SB scenarios is provided as well as two machine learning models to predict the scenario of SB using the 11 scenarios mentioned above (RQ2).
From the results of the analysis, it is clear that the toolbox performs very well in detecting SB in the generated distributions. The machine learning models show some confusion between similar SB scenarios, such as Gaps-Loose Predicted SB class probabilities of the biased GA configurations (left, sorted alphabetically) and the biased single-solution algorithms (right), using the RF model. Names for the GA are structured as mutationcrossover-selection-SDIS-population size. For the single-solution algorithms, the character in brackets refers to the used SDIS.
clusters, and Spikes-Noisy Spikes, but overall perform well enough to provide suggestions of the type of SB.
In addition we provided the results of BIAS on a large set of optimization algorithms, including variants of Genetic Algorithm and Differential Evolution, compacts and singlesolution algorithms (RQ3). The results show different kinds of SB in existing heuristic optimization algorithms. The BIAS toolbox, including the f 0 function, the statistical tests, the SB scenario generator, and the RF models are provided opensource [4]. We recommend to use the BY correction method when using BIAS, since this correction method is fast to compute and more conservative than BH. We furthermore recommend a minimum sample size of 50 since in most cases it is sufficient to detect SB for all scenarios, as can be seen from the Genetic Algorithm results. However, some mild SB scenarios could still go undetected. If time and computation power permits, a sample size of 100 would be best.
For future research, it would be interesting to explore different machine learning models to see if the predictions of the type of SB can be improved. Additionally, decreasing the number of statistical tests could improve the execution time required to run the benchmark, and may not necessarily decrease accuracy of the BIAS toolbox.