Subset Sensor Selection Optimization: A Genetic Algorithm Approach With Innovative Set Encoding Methods

The sensor subset selection problem is a crucial task in the field of sensor systems, where the goal is to choose a subset of sensors from a larger pool such that the spatial estimation representation closely approximates the performance of using all the sensors while minimizing the number of sensors required. This is a challenging problem involving trade-offs between performance, cost, and complexity. In this work, a novel approach is proposed, which is based on set encoding and genetic algorithm (GA) for addressing the sensor subset selection problem. Our method incorporates a spatial algorithm to minimize the mean squared error (mse) between the reference field and the selected subset of sensors. The efficiency of this approach is demonstrated by optimizing the sensor selection based on the water table (WT) variable using the dataset from the Savannah River Site (SRS) F-Area, a Department of Energy monitored location with groundwater contamination issues. The proposed solution introduces two new crossover methods [complementary set crossover (CSC) and uniform set crossover (USC)] and two new mutation methods [inclusive set mutation (ISM) and exclusive set mutation (ESM)] that work with the novel set encoding. The effectiveness of this method is compared with the previously used greedy strategy, showing considerable improvements. The implemented algorithm shows a substantial performance for a 33, 50, and 75 decrease in sensors, achieving an ${R}^{{2}}$ higher than 0.98. This indicates an extremely accurate depiction of the ground truth field using the optimized sensor subset.


Subset Sensor Selection Optimization: A Genetic Algorithm Approach With Innovative
Set Encoding Methods Aurelien Meray , Roger Boza, Masudur R. Siddiquee , Cesar Reyes , M. Hadi Amini , Senior Member, IEEE, and Nagarajan Prabakar Abstract-The sensor subset selection problem is a crucial task in the field of sensor systems, where the goal is to choose a subset of sensors from a larger pool such that the spatial estimation representation closely approximates the performance of using all the sensors while minimizing the number of sensors required.This is a challenging problem involving trade-offs between performance, cost, and complexity.In this work, a novel approach is proposed, which is based on set encoding and genetic algorithm (GA) for addressing the sensor subset selection problem.Our method incorporates a spatial algorithm to minimize the mean squared error (mse) between the reference field and the selected subset of sensors.The efficiency of this approach is demonstrated by optimizing the sensor selection based on the water table (WT) variable using the dataset from the Savannah River Site (SRS) F-Area, a Department of Energy monitored location with groundwater contamination issues.

The proposed solution introduces two new crossover methods [complementary set crossover (CSC) and uniform set crossover (USC)] and two new mutation methods [inclusive set mutation (ISM) and exclusive set mutation (ESM)
] that work with the novel set encoding.The effectiveness of this method is compared with the previously used greedy strategy, showing considerable improvements.The implemented algorithm shows a substantial performance for a 33, 50, and 75 decrease in sensors, achieving an R 2 higher than 0.98.This indicates an extremely accurate depiction of the ground truth field using the optimized sensor subset.Index Terms-Genetic algorithm (GA), optimization, sensor subset selection, set encoding, spatial estimation.

I. INTRODUCTION
I N TODAY's data-driven world, effective and accurate data gathering is pivotal across a spectrum of sectors, particularly in areas like environmental monitoring, sensor networks, and remote sensing.Deploying the right sensors and in the exact quantities needed is a key challenge that practitioners often grapple with.The nature of these sectors represents a high level of complexity and variability, and the sensors deployed play a critical role in the successful operation of these networks.In this modern landscape where efficiency, cost reduction, and accuracy are primary concerns, optimizing sensor deployment has become a priority.
The problem of subset sensor selection is a combinatorial optimization problem that arises in many fields, such as environmental monitoring, sensor networks, and remote sensing.In this problem, the aim is to select a subset of sensors from a larger set that can best represent the underlying field of interest.This problem has significant practical applications, as it allows us to reduce the number of required sensors while maintaining a high level of accuracy in estimating the field.
In the previous work, the same problem was tackled using a greedy approach on the Savannah River Site (SRS) F-Area dataset [1].The work focused on addressing both the number of sensors and the optimal set.However, in this work, the focus is on addressing the subset sensor selection problem strictly by solving for an optimal set of sensor locations rather than finding the number of sensors.Specifically, the goal is to find the optimal subset of sensors that can best represent the ground truth field by optimizing the selection of sensor locations.
The advancement of computational techniques has enabled the utilization of more elaborate optimization strategies to solve complex problems such as sensor selection.Genetic algorithms (GAs) are emerging as a powerful tool that offers an opportunity to tackle the intricate issue of the subset sensor selection problem more efficiently and potentially more effectively [2].
In this article, an approach is proposed to solve the subset selection problem using a GA with a novel set encoding method.By leveraging the strengths of GAs and set encoding, our approach addresses the challenges of subset sensor selection and offers a robust and efficient solution.The contributions to this field can be summarized as follows.
1) A novel encoding method is introduced for the GA to find the optimal subset of sensors.
2) The approach offers a robust and efficient solution for the subset sensor selection problem.3) Through extensive simulations, improved performance over other methods documented in the literature is demonstrated.The new method developed seeks to efficiently find not only the optimal subset of sensors but also to maintain high accuracy and performance in various applications.
The remainder of this article is organized as follows.Section II delves into an overview of related studies on sensor selection methods, while Section III offers a formal definition of the subset selection problem.In Section IV, the proposed approach-a GA with set encoding-is introduced to solve the subset sensor selection problem.Section V outlines the dataset employed in the research, and Section VI details the experimental setup, including the hardware and software utilized, as well as the parameters for the GA.The findings are especially important in today's high-tech environment where the efficient deployment of sensor networks is of pivotal importance.Section VII presents the results and engages in a comprehensive discussion.Finally, Section VIII provides a conclusion that encapsulates the key findings of our study.

II. RELATED WORKS
Existing methods for sensor selection found in the literature can be broadly classified into two categories: heuristic methods and optimization-based methods.
Heuristic methods, like the one proposed by Meray et al. [1], are based on simple rules or heuristics that aim to select a subset of sensors that can best represent the field.Although heuristic methods are computationally efficient, they may not always provide the optimal subset of sensors due to their reliance on simple rules.
In contrast, optimization-based methods have become increasingly popular for providing more effective solutions by solving optimization problems.This exploration includes different optimization algorithms that have been applied to the sensor selection problem and how they relate to the proposed approach.
Qian et al. [3] used Pareto optimization to demonstrate the potential of multiobjective optimization for complex sensor selection problems.Similarly, Hu et al. [4] applied a GA to balance both sensor cost and monitoring performance, highlighting the importance of considering multiple objectives in sensor selection.Hojjati et al. [5] addressed the sensor selection problem for large-scale networks using convex optimization and GAs, showcasing the applicability of optimization-based methods for various scales of networks.
Energy constraints have also been a focus in sensor selection research.Sun et al. [6] proposed a cross-entropy optimization method for minimizing total energy consumption in large-scale wireless sensor networks (WSNs) while maintaining coverage and connectivity.Meanwhile, Zhou et al. [7] combined GA and local search for efficient sensor selection in structural health monitoring systems, emphasizing accurate and efficient sensor selection in critical applications.
In the context of WSNs, a critical factor alongside sensor selection is effectively maintaining coverage and connectivity across the network [8].Machine learning (ML) techniques address this challenge, optimizing the balance between fewer sensor nodes, coverage, and connectivity, which directly influences data accuracy, energy consumption, and overall network longevity [8], [9].
Despite our focus on single-objective optimization in this study, the burgeoning interest and relevance of multiobjective optimization within the broader context of sensor selection warrants further discussion.Particularly as more complex multiobjective problems continue to garner the attention of researchers.One notable example is the Multiobjective Optimization Tool Chain for 3-D Indoor Beacon Placement Problem [10].This innovative approach explores indoor beacon placement from a new perspective by using 3-D point-cloud data and the NSGA-II algorithm for generating optimal beacon configurations considering cost, accuracy, and localization coverage.This further demonstrates the potential for expanding optimization methods in sensor selection, particularly for indoor localization applications.
The decision to utilize the GA emerged from its demonstrated success in a variety of sensor selection problems.GAs are well-suited for both single-objective and multiobjective optimization problems, and can efficiently search large solution spaces, making them suitable for sensor selection tasks.In the current study, we focus on a single objective of minimizing the mean squared error (mse).However, we chose the GA with an eye toward future work that may incorporate multiobjective optimization problems.Additionally, GAs are robust and can adapt to changes in the problem, allowing for the inclusion of new objectives or constraints as needed.
The decision to utilize the GA emerged from its demonstrated success in a variety of sensor selection problems.GAs are well-suited for multiobjective optimization problems and can efficiently search large solution spaces, making them suitable for sensor selection tasks.Additionally, GAs are robust and can adapt to changes in the problem, allowing for the inclusion of new objectives or constraints as needed.
The studies presented demonstrate the versatility of optimization-based methods in addressing the sensor selection problem.Through analysis of these approaches, insights into the challenges and opportunities in sensor selection can be gained and applied to design better algorithms for the proposed approach, with a focus on the use of GAs.

III. SUBSET SENSOR SELECTION PROBLEM
In this section, the subset sensor selection problem is introduced, which is the main focus of our proposed approach.Consider a set of N sensors placed in a 2-D domain D. Each sensor is characterized by a unique index i ∈ 0, 1, . . ., N − 1 and a corresponding scalar measurement y i .Let E(•) be a spatial estimation function that takes as input a set of sensor measurements and outputs an estimated field.Specifically, E(•) maps a set of n measurements y = [y 0 , y 1 , . . ., y n−1 ] T , where n ≤ N , to a vector of length |D|, representing the estimated field at each point in the domain D. The estimated field is then represented as pn = E(y).
The goal is to optimize the selection of a subset of X sensors from the total set of N sensors.We want to find the subset S ⊂ 0, 1, . . ., N − 1, where |S| = X , that best replicates the ground truth field.The ground truth field is represented by the estimated field pN = E(y N ), where y N = [y 0 , y 1 , . . ., y N −1 ] T .
The quality of the estimated field pX obtained from the subset S is measured using the mse between pX and the ground truth field pN .The objective function is given by arg min where the mse is defined as and pN,k and pS,k denote the estimated values of the field at the kth point in the domain, obtained from pN and pS , respectively.The solution to this optimization problem lies within the N X combinations of subsets of size X of the N sensors.

IV. METHODOLOGY
This section presents the methodology for our GA-based optimization strategy to select a subset of sensors in a network.The approach utilizes a novel integer set encoding method that ensures no repetition and disregards order.A fitness function is proposed that minimizes the mse between the predictions made using the selected subset of sensors and using all sensors.Two novel crossover operations, complementary set crossover (CSC) and uniform set crossover (USC), are introduced that adhere to the set encoding scheme while respecting problem constraints.In addition, two mutation methods, inclusive set mutation (ISM) and exclusive set mutation (ESM), are incorporated to enhance the optimization process's diversity and exploration of the search space.The methodology is presented as a flowchart diagram in Fig. 1, which outlines the overall framework of the GA-based optimization strategy.The process consists of initializing a population represented as chromosomes, evaluating fitness using the mse-based fitness function, and iterating through the selection, crossover, and mutation operators until the stopping criterion is met.The optimal subset of sensors is represented by the best chromosome found during the optimization process.Each operation/component is explained in detail in Section IV-A of the methodology.

A. Genetic Algorithm-Based Optimization Strategy
In this section, a GA-based optimization strategy is presented specifically designed to tackle the sensor selection problem.Through a series of innovative techniques, including a novel chromosome representation and an encoding method, as well as unique crossover and mutation operations, our approach aims to efficiently explore the search space and converge toward an optimal solution for selecting the best subset of sensors from a sensor network.
1) Chromosome Representation and Encoding: In the optimization problem of selecting X sensors from a network of N sensors, there are various encoding methods available, such as binary, gray, and integer encoding [11].Our proposed method, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
integer set encoding, shares similarities with integer encoding but offers unique properties that enforce the constraints of the problem.
Sets naturally enforce no repetitions and disregard order, making them an ideal choice for our problem [12].Each of the N sensors is assigned an integer from 0 to N − 1.This set of integers represents the universal set U = 0, 1, . . ., N − 1.A chromosome C is simply one combination (set) of N choose X sensors, denoted as N X .This approach ensures that the same sensor cannot be selected twice.
The genes in chromosome C are used to fit a spatial estimation matrix, which helps evaluate the performance of the selected subset S of sensors.By using this chromosome representation and encoding method, it becomes possible to effectively apply a GA to solve the subset sensor selection problem while respecting the constraints and properties of the problem.This novel approach of integer set encoding in GAs demonstrates the potential for innovative solutions to optimization problems.
2) Fitness Function: The fitness function for the GA evaluates the performance of each chromosome C, representing a subset of sensors S, in the context of the subset sensor selection problem.Our objective is to minimize the mse as defined in (2).
The fitness function guides the GA in exploring the search space of N X combinations of subsets of size X of the N sensors and aims to find the optimal subset S that yields the lowest possible mse.
To evaluate the fitness of a given chromosome C, the measurements pS predicted by the regression model using only the sensors in S are first calculated, along with the measurements pN predicted using all N sensors.The mse between pS and pN is then calculated using (2).The resulting mse value represents the fitness of the chromosome, with lower values indicating a better fitness score.
3) Crossover Operations: Crossover operations play a critical role in generating new offspring in GAs.However, traditional strategies such as N -point crossover are not directly applicable to our problem since they tend to produce duplicate values, resulting in invalid chromosomes [13], [14].To overcome this limitation, two novel crossover operations are proposed that adhere to the set encoding scheme while respecting the problem constraints.
Consider two parent chromosomes, P 1 and P 2 , each of cardinality X and with H common elements.The following crossover methods are introduced.
Method 1 (CSC): The CSC combines the common elements of both parent chromosomes, supplementing them with randomly sampled elements from their symmetric difference.This method can be mathematically represented as To obtain the final equation, the random sampling operation needs to be replaced with the symmetric difference operation and the set difference operation with the set of all possible elements in the search space.First, the random sampling operation is replaced with the symmetric difference operation, denoted by ⊕ where V is a set of size 2X containing all possible elements in the current search space, and P 1 and P 2 are the parent chromosomes.This equation combines the common elements of both parent chromosomes with randomly sampled elements from their symmetric difference, where the set of all possible elements in the search space is used instead of the set difference operation.Method 2 (USC): The USC works similar to the bitwise uniform crossover as described by Whitley [15] in his GA tutorial.This terminology emanates from the method's ability to sample elements in a uniform manner, displaying no favoritism toward either parent chromosome.The mathematical representation of the USC method manifests as follows.
Let P 1 and P 2 be two parent chromosomes of length X .X elements are randomly sampled from the set P 1 ∪ P 2 to obtain the child chromosome C 2 , such that Fig. 2 illustrates the two proposed crossover methods.These innovative crossover methods contribute to the effective exploration of the search space while preserving the validity of the chromosomes, thus facilitating the GA's convergence toward an optimal solution for the subset sensor selection problem.
4) Mutation Operations: Mutation operations play a crucial role in GAs by introducing diversity into the population, allowing the search process to explore new solutions [16].Mutation distance (MD) is defined as the cardinality of the difference between the original chromosome C and the final mutated chromosome  encoding are introduced and the concepts and equations are further elaborated.
Method 1 (ISM): ISM enables a gene to be mutated from a gene pool that includes replacement.A mutation mask, M, is created, which is a binary string of the same length as C.Each element of M has an α probability of being 1 (1 symbolizes a flag for mutation).α is a hyperparameter in the optimization.
The process involves iterating over the genes flagged for mutation in C and randomly selecting an element from C, which is the complement of C. When flagging a gene for mutation, the element in C and C are swapped, allowing the gene that was in C to reappear in C in a future mutation, given that the same element is selected again.This process occurs due to replacement.
For ISM, the minimum MD is 2, and the maximum MD can be calculated as follows: An example of ISM operation is shown in Fig. 3. Method 2 (ESM): ESM allows a gene to be mutated from a gene pool that has no replacement.If an original gene is removed from the chromosome (since it was mutated for another gene), it cannot reappear in a future mutated gene spot.This method functions similar to Method 1, with a mutation mask M. The distinction lies in that the gene (element) of C being mutated does not return to C, as there is no replacement.For Exclusive Mutation, the maximal number of mutation operations equals the cardinality of C. Elements to mutate become exhausted if the number of 1's in the mutation mask is greater than the cardinality of C. To enforce this condition, it is made certain that the number of 1's in the mutation mask is less than or equal to the cardinality of C.
The MD for exclusive mutation can be calculated as An example of ESM operation is shown in Fig. 4.These mutation methods contribute to the diversity of the population and facilitate the exploration of the search space while preserving the validity of the chromosomes.
5) Selection Mechanism: One of the main stages of any GA is the selection mechanism stage.During this stage, individual chromosomes are chosen from the population based on their weighted fitness value.The chromosomes selected will be the parent chromosomes and will be used to create new individuals or offspring for the next generation.There are many methods for the selection of individual chromosomes: Roulette Wheel Selection, Rank Selection, and Steady State Selection, among others [17].For this problem, the Roulette Wheel Selection mechanism was used.In the roulette wheel selection, the idea is to create a circular wheel divided into different subsections representing each individual chromosome of the population based on their fitness score.Chromosomes with higher fitness scores will have a larger piece of the circular wheel area or subsection.By having a fixed point and rotating the roulette wheel, the chromosome that lands on the fixed point will be chosen to be one of the parents.The benefit of this approach is that individuals with higher scores will have more area of the roulette wheel, and therefore will have a higher chance of landing on the fixed point [18].Additionally, there is a possibility that on different spins of the wheel, other chromosomes that are not the most fitted ones can be chosen as parents for reproduction, which enables the algorithm to get out of local minima or maxima search spaces.Conventionally, the roulette wheel selection is implemented in such a way that the best fit is chosen by finding the highest fitness value.In this implementation, the fitness value is the mse for a given chromosome, and so the fittest individual is the one with the lower fitness value [19].To account for this change during the selection process, the inverse of the fitness values is computed using the following formula: 1÷mse( pN , pS).This adjustment facilitates the utilization of the original roulette wheel selection process.
6) Reinsertion: The process of reintroducing newly created offspring into the population is known as reinsertion.This step is crucial in the GA process as it involves replacing the chromosomes with the lowest fitness value.By doing so, the GA ensures that the fittest individuals stay in the population while allowing room for genetic diversity and exploration.After reinsertion, the next generation is generated through a process that involves selecting, crossing over, mutating, and reinserting new chromosomes with a reinsertion rate denoted by β.This iterative process continues for multiple generations until the stopping criteria is met.At each generation, the weakest individuals are replaced by the fittest ones, leading to better and better solutions [20].Through the crucial step of reinsertion, the GA maintains a delicate balance between preserving the fittest individuals in the population and introducing genetic diversity, enabling it to efficiently converge toward optimal solutions over multiple generations.

B. Time Complexity Comparison and Analysis
In this section, the time complexity of three methods for solving the problem of selecting the best subset of sensors is compared.These methods are brute force, greedy, and our proposed GA.Our analysis shows that while the brute force method has exponential time complexity and the greedy method has polynomial time complexity, our GA provides better solutions despite its higher time complexity.
1) Brute Force: The brute force method involves considering all possible subsets of size X out of N features and evaluating the performance of the model on each subset.The worst time complexity of computing N choose X is O(N X ).This is because in the worst case, you would need to compute all possible combinations of X elements from a set of N elements, which is equal to N choose X .In practice, this method becomes infeasible for large values of N and X .
2) Greedy: The greedy method involves starting with an empty set of features and iteratively adding the best feature until a subset of size X is obtained.The performance of the model is evaluated after each addition.The time complexity of the greedy method can be expressed as a polynomial of degree 2 in X and is given by the following equation: Therefore, the worst case time complexity of evaluating this expression would be O(X 2 ), since X is the largest exponent of the polynomial.The greedy method is more efficient than the brute force method, but it still has limitations in terms of scalability.
3) Genetic Algorithm: Our proposed GA involves evolving a population of solutions over multiple generations using mutation, crossover, and selection operators.The time complexity of a single generation can be broken down into several components, as described earlier.For the entire GA, the algorithm needs to be run for G generations.Thus, the total time complexity of the GA is given by the expression Here, P is the population size, X is the length of the chromosome, k is the number of samples requested during the random sampling process of the crossover operation, α is the mutation probability, and β is the reinsertion rate.The time complexity of the GA scales linearly with the population size and the number of generations, and it depends on the length of the chromosome, the mutation probability, the number of samples requested during crossover, and the reinsertion rate.
To determine which factor dominates the time complexity, the actual values of P, G, β, α, and X used in the experiment can be compared.For instance, when using P = 2000, G = 1000, β = 1, α = 0.25, and X = 30, the simplified expression yields a time complexity of O(15, 000, 000, 000), while the original expression yields a time complexity of O(304, 867).In general, the simplified expression O(P * G * X * α) is more comprehensive and accurate in representing the time complexity of a GA, as it takes into account all the relevant factors that affect the computational cost of the algorithm.
Although its time complexity may suggest slower performance, our proposed GA algorithm outperforms the other two algorithms in terms of solution quality.This can be attributed to the inherent advantages of the GA, such as its ability to explore a wider search space and avoid local optima.Furthermore, the GA implementation employs a spatial estimation function that reduces the search space and accelerates the convergence of the algorithm.
Although the time complexity of the GA is higher than that of the greedy algorithm, the actual running time of the GA is comparable or even faster in practice due to the faster convergence and better solution quality.This is especially true for large values of N and X , where the brute force algorithm is impractical, and the greedy algorithm may get stuck in local optima.In contrast, the GA is able to find global optima with a high probability and can be parallelized to further improve its performance.

V. DATA DESCRIPTION
This study utilizes the F-Area Historical dataset as a source to optimize the best sensor locations to place a limited number of sensors at the Department of Energy's SRS F-Area.The F-Area is a groundwater contaminated site with many wells that were drilled with the intention to both: 1) collect water samples to perform measurement testing in a laboratory and 2) place sensors for remote monitoring [21].This dataset contains over 400 analytes (features) with measured readings over time at many well locations [22].It is important to note that these recorded samples were measured in the laboratory and do not come from existing sensors, however, this article will consider the dataset as ground truth sensor measurements from the F-Area and optimize the selection of sensor locations strictly from this data source.In this article, the scope of the problem is limited to one sensor parameter, the water table (WT) on the average 2015 value, which was the same as in the greedy approach work in Meray et al. [1] and because it has historically been used as an indicator for plume movement [23].The same data is used to compare their results with the approach presented in this article.

VI. EXPERIMENTAL SETUP
In this section, the hardware and software utilized for the experiments are described, along with the specific parameters and settings of the GA employed in the analysis.

A. Hardware and Software
The experiments were conducted using a desktop computer with the specifications mentioned in Table I.Python, NumPy, Pandas, and Matplotlib were utilized for coding, analysis, data manipulation, and visualizations, respectively.

B. Genetic Algorithm Parameters and Settings
Experiments were conducted using various combinations of parameters, as detailed in Table II, encompassing population size, number of generations, reinsertion rate, crossover method, mutation method, and mutation probability.Initially, the product of the population size and number of generations (P:G) was set to 10 000.This means that if the population size is set to 100, then the number of generations would be set to 1000, for a total of 10 000 function evaluations.However, the actual number of generations used in our experiments was determined dynamically using our stopping algorithm.
Algorithm 1 outlines our stopping mechanism to determine convergence based on mse history.The stopping mechanism is based on mse convergence threshold, which is a user-defined value that indicates the acceptable level of error.The algorithm Algorithm 1 GA Stopping Mechanism c ← max(⌊0.25G⌋,50) ▷ Initialize convergence parameters 4: for i ← 1 to G do 6: ▷ Loop over generations 7: ▷ Check convergence condition 9: n ← l 10: ▷ Update parameters 11: ▷ Check if MSE has converged 14: print "MSE converged after i generations." ▷ MSE has not converged 21: end procedure takes as input the population size, maximum number of generations, mse convergence threshold, and the history of mse by generation.The algorithm iterates through each generation, calculating the difference between the maximum and minimum mse of the previous l generations, where l is set to max(⌊0.2P⌋,20).If the difference is less than or equal to the mse convergence threshold, the algorithm stops and returns a Boolean flag indicating that the mse has converged.Otherwise, the algorithm continues to iterate through generations until the maximum number of generations is reached.Incorporating a stopping mechanism helps in avoiding squandering computational resources and time on redundant iterations when the solutions cease to show significant improvement.

VII. RESULTS AND DISCUSSION
In this section, the findings are presented and discussed, obtained from the comparative analysis of the greedy and GAs in optimizing subset sensor selection.The performance of the proposed GA approach on the DOE dataset is also explored, evaluating the impact of various methods and parameters on the mse.The results are organized into two main parts: 1) comparing the performance of the greedy algorithm and the GA, and 2) evaluating the proposed GA approach on the DOE dataset.Insights are provided into the significance of the mutation probability and the absence of significant differences between the crossover and mutation methods.Finally, the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III PERFORMANCE COMPARISON BETWEEN GREEDY AND GA
implications of the findings and their potential applications in real-world sensor selection scenarios are discussed.

A. Greedy Versus Genetic Algorithm
The performance of the greedy algorithm from Meray et al. [1] was compared to the GA in selecting optimal set of sensors for the spatial algorithm.To make a fair comparison between the two approaches, the greedy algorithm was run 100 times, with each run generating a solution by randomly selecting the first five sensors (as required by the spatial algorithm).On the other hand, the GA was run with a population of 100, with the initial population containing the same five randomly sampled sensors.The GA was then run for 100 generations, and the chromosomes from the last generation of the single GA run were examined.Thus, 100 runs of the greedy algorithm were compared to one run of the GA, with 100 individual solutions for the greedy and 100 chromosomes for the GA.
The comparison between the two approaches is visualized in Fig. 5 and summarized in Table III.The results of this comparison for three different scenarios with 11, 23, and 30 sensors, respectively, are shown.This includes the mse values for each algorithm with three different numbers of sensors and the corresponding percentage reduction in the number of sensors.The results show that the GA outperforms the greedy algorithm in selecting the best subset of sensors for all three scenarios.The mean mse values for GA are consistently lower than those for the greedy algorithm, with percentage reductions ranging from 55.73% to 74.59%.It can also be observed that there is a large variability in the greedy performance whereas the GA produced a much narrower distribution of solutions with the same five initial sensors.This is also represented in the standard deviation of errors where the GA is consistently lower than that for the greedy algorithm, indicating that the GA approach provides more consistent results.

B. Evaluating the Performance of the Proposed Approach on the DOE Dataset
In this section, the performance of the proposed approach to the subset sensor selection problem using the DOE dataset is examined.The focus is on optimizing the subset size X = 11, X = 23, and X = 30 out of 46 sensors, corresponding to (3/4)%, (1/2)%, and (1/3)% reductions, respectively.Additionally, the impact of various methods and parameters on the mse is explored.
To determine any statistically significant differences between the parameters from Table II (excluding population and generations), the correlation between these variables and the mse was evaluated.Their statistical significance was also assessed using p-values, considering a result to be statistically significant if the p-value is less than 0.05.The primary objective was to ascertain if there are specific advantages to using one crossover or mutation method over another.Table IV presents the correlation and statistical significance of the mse and parameters.Table IV emphasizes the strong correlation and statistical significance between the mutation probability α and mse across all three optimization scenarios, indicating that optimizing the α parameter may lead to further reductions in the mse.
For each optimization case, the following holds.
1) X = 11: The minimum mse was 5.55e − 05, with α displaying a statistically significant correlation with mse at the 0.05 level (0.339352).2) X = 23: The minimum mse was 1.40e − 05, with no statistically significant correlations observed between mse and any parameters at the 0.05 level.3) X = 30: The minimum mse was 1.998e − 05, with the variable α demonstrating a statistically significant correlation with mse at the 0.05 level (0.689901).
Interestingly, our results did not reveal any significant differences between the crossover and mutation methods.This outcome could be attributed to the limited sample size of 46 sensors.Consequently, further studies with larger datasets may be required to validate our findings.Nevertheless, our proposed approach demonstrates the potential in addressing the challenges of subset sensor selection, and future studies could aid in refining optimization parameters and enhancing the overall performance of our approach.
A heatmap (see Fig. 6) was generated to visually represent the correlation between mse and the parameters of the GA.The heatmap confirms the robust correlation between mse and the α parameter and reinforces the absence of significant differences between the crossover and mutation methods.
The GA's performance in subset sensor selection is thoroughly analyzed through the provided results, with the Greedy approach used as a means for comparison.The ground truth estimation map, shown in Fig. 7(a), serves as a basis for evaluating the optimal sensor configurations identified by the GA Fig. 6.
Heatmap illustrating the correlation between mse and the parameters in the proposed approach.
for 11 [see Fig. Upon closer examination of Table V and Fig. 8, it becomes evident that the top three GA-generated sensor configurations for 11, 23, and 30 sensors surpass the Greedy approach in terms of mse and R 2 metrics.R 2 is calculated using the following equation: Interestingly, it is noticeable that the mse for the 23-sensor case is lower than the 30-sensor case, which does not seem intuitive since having more sensors should provide more data for the spatial estimation.One possible explanation is that the added sensors are redundant, meaning that they do not provide any additional information compared to the sensors already included in the subset.Another possible explanation is that the added sensors introduce noise or measurement errors that outweigh their potential benefits.In this case, the presence of additional sensors might not necessarily lead to improved accuracy, as their potential drawbacks could counteract the benefits they bring to the system.Coincidentally, our GA approach to identify the optimal subset sensor selection demonstrated better performance using 23 sensors instead of 30, even though our primary objective was not to determine the ideal number of sensors to employ for monitoring, but rather to identify the best sensor subset given X .
Notably, the GA attains an mse of 5.55e − 05 and an R 2 of 0.983 for the 11-sensor configuration, whereas the Greedy method's best result yields a higher mse of 7.41e − 05 and a lower R 2 of 0.977.This trend is consistent for the 23 and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.30 sensor configurations, with the GA continuously achieving lower mse values and higher R 2 in comparison to the Greedy approach.

TABLE V TOP THREE RESULTS FOR EACH OPTIMIZED NUMBER OF SENSORS WITH FINAL ENCODED SENSORS EVALUATED USING MSE AND R2 METRICS
Furthermore, the GA's superior performance over the Greedy method across all three sensor configurations is emphasized in the comparative bar plot shown in Fig. 8.The GA's effectiveness in minimizing mse demonstrates the robustness of the proposed algorithm in optimizing the subset sensor selection problem.
Our findings have significant implications for real-world sensor selection scenarios in various industries, including but not limited to environmental monitoring.By implementing the developed techniques, it is possible to optimize sensor networks to monitor spatial domains with fewer sensors, resulting in a more cost-effective, manageable, and sustainable system.In our particular case, this leads to timely decision-making and better environmental monitoring.However, the applicability of our research extends beyond this domain, with the potential to benefit diverse sectors such as healthcare, transportation, and manufacturing.Ultimately, the advancements in our research not only contribute to the enhancement of sensor network management but also foster a greater understanding of the potential applications across different industries, paving the way for more resource-efficient and innovative practices.

VIII. CONCLUSION
The sensor subset selection problem plays a vital role in the field of sensor systems as it aims to maintain high accuracy and performance in a variety of applications by choosing an optimal subset of sensors from a larger set.In this study, a novel approach that combines set encoding and a GA to address this problem was introduced.This method integrates a spatial algorithm designed to minimize the mse between the reference field and the selected subset of sensors.Although generating a reference field relies on the deployment of sensors, the scalability of the method is not constrained by the need for a large number of deployed sensors.The effectiveness of this approach was demonstrated by optimizing sensor selection for the WT variable, utilizing a dataset from the SRS F-Area.The findings suggest that the proposed method holds a competitive edge over existing techniques in terms of both accuracy and efficiency.
Implementation of this method in real-time applications presents certain challenges.Computational complexity stemming from the GA may inhibit fast solutions.Dynamic environments may necessitate system adaptation where sensor relationships or the underlying fields may undergo changes.And as the number of sensors and variables rises, the scalability of the can become a concern.Potential solutions could involve exploring alternative optimization algorithms for faster performance, incorporating adaptive mechanisms like online learning or regular updates to optimization parameters, and applying approximations, parallel processing, or distributed computing for scalability.
Looking forward, the intention is to expand this approach to more complex scenarios such as nonlinear models and multiobjective optimization problems.This extension will involve adjusting the objective function to optimize multiple variables and time steps concurrently, thus improving its multiobjective capabilities.There are also plans to explore additional optimization algorithms that can work in harmony with set encoding for sensor subset selection.Additionally, potential applications of this method across various fields including environmental monitoring, industrial control systems, and healthcare will be investigated.Ultimately, the proposed method provides a robust and efficient solution to the sensor subset selection problem, offering significant potential for further advancements and applications across diverse domains.

Fig. 1 .
Fig. 1.Flowchart of the key operations in the GA implementation.

Fig. 5 .
Fig. 5. Comparison of mse values for greedy algorithm and GA across different numbers of sensors.
7(b)], 23 [see Fig. 7(c)], and 30 [see Fig. 7(a)] sensors.Fig. 7 (bottom row) indicates the spatial difference between the ground truth matrix and the GA-generated prediction maps, illustrating the algorithm's capability to provide accurate estimations while maintaining low error rates.

Fig. 7 .
Fig. 7. estimation of optimal well configurations for 11, 23, and 30 sensors.(a) Ground truth estimation map.(b)-(d) Best sensor configurations selected with the lowest mse for 11, 23, and 30 sensors, respectively.The bottom row displays the spatial difference between the ground truth and the generated prediction map.

Fig. 8 .
Fig. 8.Comparison of the performance of greedy and GAs for 11, 23, and 30 sensor configurations shown in a bar plot.

TABLE IV CORRELATION
AND STATISTICAL SIGNIFICANCE OF MSE AND PARAMETERS.NOTE THAT AN ASTERISK ( * ) INDICATES STATISTICAL SIGNIFICANCE (p-VALUE < 0.05)