Toward Improved Reliability of Deep Learning Based Systems Through Online Relabeling of Potential Adversarial Attacks

Deep neural networks have shown vulnerability to well-designed inputs called adversarial examples. Researchers in industry and academia have proposed many adversarial example defense techniques. However, they offer partial but not full robustness. Thus, complementing them with another layer of protection is a must, especially for mission-critical applications. This article proposes a novel online selection and relabeling algorithm (OSRA) that opportunistically utilizes a limited number of crowdsourced workers to maximize the machine learning (ML) system's robustness. The OSRA strives to use crowdsourced workers effectively by selecting the most suspicious inputs and moving them to the crowdsourced workers to be validated and corrected. As a result, the impact of adversarial examples gets reduced, and accordingly, the ML system becomes more robust. We also proposed a heuristic threshold selection method that contributes to enhancing the prediction system's reliability. We empirically validated our proposed algorithm and found that it can efficiently and optimally utilize the allocated budget for crowdsourcing. It is also effectively integrated with a state-of-the-art black box defense technique, resulting in a more robust system. Simulation results show that the OSRA can outperform a random selection algorithm by 60% and achieve comparable performance to an optimal offline selection benchmark. They also show that OSRA's performance has a positive correlation with system robustness.


I. INTRODUCTION
D EEP neural networks (DNNs) have shown noticeable suc- cess in handling computer vision (CV) problems, such as image classification [1] and object detection [2].They have also demonstrated great success in other complicated machine learning (ML) tasks such as speech recognition [3] and natural language processing [4].However, DNNs have not yet shown remarkable improvement in learning the true underlying concepts that lead to correct prediction labels, so the relationship between predictions and their reasons is not well established [5].That is, in the image classification task, as an example, images and the corresponding labels are meaningless from the DNN perspective.There is a semantic gap (misalignment) between how computers and humans interpret the image representation.For that, they suffer from a phenomenon called adversarial examples [6].Szegedy et al. [7] observed the presence of adversarial examples (evasion attacks) in the image classification task, where it is possible to change the predicted label of the image by adding a well-designed small amount of perturbation.Since then, the proposed methods for crafting adversarial examples have never stopped [8], [9].
In attempting to make DNNs robust against adversarial examples, various solutions have been proposed in the literature [10], [11], [12], [13], [14], [15], [16].However, these countermeasure techniques have managed to only partially strengthen the DNN models.They make the DNN models more but not fully robust [17].For that, DNNs are not totally reliable, which restricts them from being used in safety and mission-critical applications.To augment their robustness, they need to be supported by another layer of defense.In this article, we propose a two-stage solution where we utilize a state-of-the-art (SOTA) defense technique as a primary stage (Stage 1) and then complement it with an online selection and relabeling algorithm (OSRA) as an added validation layer (Stage 2).Specifically, we first form the proposed framework in two stages to decouple suspicious and nonsuspicion elements making the system more efficient.That is, only the suspicious elements are affected by the additional overhead of OSRA in Stage 2. The OSRA is a sliding-windowbased algorithm that opportunistically utilizes the crowdsourced workers to maximize the ML system robustness.It strives to use crowdsourced workers efficiently by selecting the most suspicious elements in Stage 1 (the potential adversarial examples) and moving them to the crowdsourced workers to get validated and relabeled.In other words, the OSRA aims to maximally identify and relabel adversarial examples (wrongly predicted samples) considering limited relabeling budget.Moving all the suspicious elements for relabeling is infeasible because we are restricted by a limited validation and relabeling budget.In an online selection algorithm, the inputs come in a random order, and the selection is made on the spot (take it or leave it instantly), and, most importantly, we cannot revoke that selection decision [18].
Crowdsourcing is a costly process compared with the automatic prediction process that does not require human intervention.Hence, we consider budgeted crowdsourcing, where a predefined budget is allocated to the process.An opportunistic use of crowdsourced workers indicates a strive in utilizing the allocated budget efficiently.Thus, in this article, "opportunistic use of crowdsourced workers" and "utilizing allocated budget efficiently" imply one another.In this article, we assume that one allocated budget unit is used for requesting a crowdsourced worker to validate one element.Therefore, "the allocated budget" and "the crowdsourced elements" are used interchangeably.Our suggested approach is motivated to be used in mission-critical applications, such as X-ray inspection, where the relabeling is expensive (expert doctors do it).For that, our approach is optimized to minimize the number of selected elements (allocated budget) while maximizing the success rate of relabeling.This is achieved by validating the most suspicious elements following our system model, as detailed in Section III.
We also consider utilizing crowdsourcing within our proposed algorithm because it works naturally well in handling image classification tasks [19].In other words, a crowdsourced worker can easily validate a suspected image by comparing its predicted label (assigned in Stage 1) with the actual representation of that image.Section III-B2 presents a detailed illustrative example that shows how our proposed algorithm works.To further enhance the proposed algorithm OSRA, we propose a heuristic threshold selection method to filter the stream of output confidence from Stage 1 (see Section III-B3).
Our proposed work is generic and can be applied in many application domains.The work in this article is considered just an example of one use case, i.e., applying OSRA in the robust ML domain.The OSRA can be applied in other domains, in particular, the mission-critical domains that are not time sensitive, such as face recognition systems and X-ray inspection.
Our hypothesis is to opportunistically use a limited number of crowdsourced workers to enhance the robustness of SOTA adversarial ML defense against black box adversarial attacks.Our proposed algorithm OSRA uses an online selection method that prioritizes the selection of the most suspicious elements to be moved to a budget-constraint crowdsourcing process to get verified and relabeled.We expect that the OSRA can give a better success rate of relabeling than that of the random selection method (as the worst case baseline) and achieves a competitive performance to an optimal offline selection algorithm (as the best case benchmark).We also demonstrate a theoretical and empirical proof to show that our hypothesis holds.To the best of our knowledge, this is the first work that considers budgeted crowdsourcing as a complementary layer to adversarial ML defenses.
The main contributions of this article can be summarized as follows.
1) We propose an OSRA on top of adversarial example defense method to further enhance the robustness of the DNN-based system.2) Mathematically, we prove that the proposed scheme maximizes the model's robustness by keeping the allocated budget (the number of crowdsourced workers) lower.3) We propose a heuristic threshold selection method to filter the stream of output confidence that come from the primary stage and, hence, enhance the performance of OSRA.4) To validate the effectiveness of the proposed defense scheme, we conducted extensive experiments on an image classification task under the evasion attacks, more specifically the black box (transfer-based) threat model attacks.Through extensive experiments, we aim to answer the following research questions.a) How is the performance (the success rate of relabeling) of the proposed OSRA compared with two baseline algorithms (the optimal offline selection algorithm and the random selection algorithm)?b) How optimal is the OSRA's suggested sliding window size?c) How is the impact of the size of the stream of inputs and the allocated budget on the performance of the proposed algorithm?d) How is the impact of adding the proposed heuristic threshold selection method for filtering the stream of inputs on the OSRA's performance?e) How does leveraging OSRA on top of a SOTA model enhances the robustness of a DNN-based system?The rest of this article is organized as follows.Section II presents an overview of the related work.Section III provides the details of the proposed scheme.Section IV describes the proposed mathematical model of the proposed budgeted crowdsourcing defense scheme against adversarial attacks.Section V details the experimental setup, experimental results, discussion, and lessons learned.The answers of these questions are detailed in Section V-D and summarized in Section V-E.Finally, Section VI concludes this article.

A. Heuristic Adversarial ML Defenses
In the best-effort defenses, researchers have investigated various heuristics to improve the robustness of DNNs.Examples of these heuristic techniques include DNNs' distillation [12], DNNs' inputs transformations [20], generative models [21], adversarial training [24], ensemble training [14], [30], [31], and randomization [22].However, most of these proposed defensive techniques have been circumvented soon after they got published [17], [32], [33].They are not theory-backed, so the game between adversaries and defenders keeps going and the adversary wins at the end.The most effective adversarial defenses in the heuristic category are adversarial training and ensemble training methods [14].They have withstood against adversarial attacks and achieved partial robustness.Similarly, the adversarial defenses that rely on detection (they capture the potential adversarial examples and reject them), such as autodetection of adversarial examples [23], [34], and runtime detection, such as [35], are not fully robust detectors.Tramér et al. [36] demonstrated that any claimed robust detector can be converted to a robust classifier, which means that detecting adversarial examples is as difficult as classifying them because adversarial defense classifiers are not yet totally robust.
Adversarial training [24] has acquired significant attention from the adversarial ML research community for its reliability and effectiveness.It is the process of crafting adversarial examples while simultaneously training DNN models with these adversarial examples after assigning them correct labels.The literature demonstrates that adversarial training urges a model to obtain robust features within datasets [6].This process is hard and expensive, so it substantially influences the accuracy of the model on benign data [37].Alternatively, ensemble training methods have been investigated.
Ensemble approaches are considered to be the SOTA technique for various ML problems [38].They enhance the predictive power of a model by training several submodels and merging their predicted scores.The robust ensembles can be obtained by removing the shared adversarial vulnerabilities in different submodels of the ensemble.For that, within the ensemble, the adversarial examples cannot transfer from a submodel to another.Several works try to encourage submodels' diversity to mitigate adversarial example transferability.For instance, Pang et al. [30] introduced the adaptive diversity promoting regularizer, which promotes various submodels to gain high diversity in the nonmaximal predictions.Kariyappa and Qureshi [31] minimized the vulnerability overlapping between various submodels by maximizing the cosine distance between gradients in an individual submodel with respect to the input.Yang et al. [14] proposed a vulnerability diversity metric to be utilized during the ensemble training process to ensure that the submodels have a diverse vulnerability via minimizing the overlapped vulnerabilities within the ensemble.That is accomplished in a way that aligns well with diversifying the adversarial vulnerability shared by various submodels.Hence, this article achieves better robustness against the transferability of adversarial attacks between submodels.This approach is considered SOTA in the category of ensemble training methods.

B. Certified Adversarial ML Defenses
To end the arms' race between attackers and defenders related to the best-effort defensive techniques, certified robustness has emerged as a new research direction.It is in its infancy, though.Certified defenses provide a quantifiable guarantee that describes the space of inputs that produce errors.Huang et al. [39] developed the first certified robustness system for showing that the output label is constant throughout a specific area.Certified defenses assume that adversaries craft attacks do not exceed defined distance metrics that quantify the similarity between the original and adversarial examples.L p norms (L 0 , L 2 , L ∞ ) are the common distance metrics used in the literature to quantify similarity [17], that is, how adversarial examples are different than the original ones.This system suffers from lacking scalability.Scaling to larger DNNs needs more strict assumptions (e.g., the prediction of a specific input point relates to a subgroup of DNN units).These kinds of assumptions imply that the system can no longer offer guaranteed robustness.That is, an adversarial example that breaches the assumptions could not be detected.Reluplex [11] is another certified robustness system that utilizes linear programming solvers to size to much larger networks.The SOTA in this line of works is randomized smoothing [13], [25], [26], [27], [28].Smoothing-based certified defenses work by considering the majority vote of the prediction of various randomly distorted copies of the input through the model being defended.
The scope of these certified robustness systems is limited because they only ensure that the perturbed inputs are predicted correctly when the perturbation results in adversarial examples are located in a restricted area around the original input.If the perturbation results in adversarial examples are positioned far from a specific area surrounding the original input, they would not be predicted correctly.It is unattainable for a defender to fully expect all the potential attacks when specifying the area encompassing x.

C. Stateful Detection of Adversarial Examples
Stateful adversarial defense differs from the earlier work defenses as it keeps the history of the preceding queries.Protecting Against DNNs Model Stealing (PRADA) [29] is an example of stateful detection of adversarial examples.Utilizing the history of queries, PRADA detects black box model extraction adversarial attacks.It investigates only the distribution of distances.For that, its scheme is not robust to the insertion of dummy queries to make the distribution Gaussian.Also, it does not consider how to capture the generation of adversarial examples.
Carlini et al. [15] propose a stateful technique that detects adversarial examples by maintaining a history of the previous queries, so a defender can recognize when a series of queries are suspected and could be used for creating an adversarial example.It detects query-based black box attacks, but cannot detect Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I COMPARISONS OF DIFFERENT DEFENSE TECHNIQUES AGAINST EVASION ATTACKS
transfer-based black box attacks.Based on [15], this stateful detection strategy can be applied on top of adversarial training defenses.However, for the sake of simplicity, it is implemented on top of a traditional (nonrobust) model.Studying adversarial examples from the perspective of stateful systems is more realistic.Furthermore, it puts the defenders in a better position when responding to the adversaries [15].However, stateful defenses do not support black box (transfer-based) attacks and can only be applied in specific situations, where the end user (potential adversary) is forced to create an account on the system that hosts the model, so her/his queries are tracked and audited.
As shown above, the best-effort, certified, and stateful defenses have their limitations, and they are not effective at offering full robustness to DNN models against adversarial examples.In other words, they defend some but not all adversarial examples.Enhancing DNNs' robustness through integrating the most effective and SOTA best-effort defense with an extra layer of security is a promising approach to improve the robustness.This security layer could be a budgeted crowdsourcing technique (human intervention layer) or another automatic layer (e.g., another ML model).While adversarial examples can fool model-based (automatic) adversarial defenses, they cannot fool humans as adversarial examples attacks do not change the actual representation of the image.For that, our crowdsourcing-based (manual) approach can effectively capture and relabel adversarial examples.Table I categorizes the explored adversarial ML defenses based on the adversarial defense approach, the threat model, and the availability of a complementary defense layer.To the best of our knowledge, leveraging budgeted crowdsourcing as an additional validation layer on top of heuristic defensive technique is lacking.With the absence of automatic and fully robust adversarial defenses, and taking into consideration that adversarial examples fool CV but not human vision, a reliable human-based defensive layer is an important approach that should be adopted.

III. PROPOSED SYSTEM MODEL
Before illustrating our system model, we introduce the relevant key terms and concepts to facilitate understanding of the subsequent sections.In this article, robustness indicates the system's degree of resilience toward adversarial example attacks.The more resilience, the more robust the system is.Our work (OSRA) contributes to the overall robustness of the system by the added robustness (explained further in Section V-C5).On the other hand, reliability of our DNN-based system implies the probability of the correct outputs, i.e., the probability of the correct predictions of the model in Stage 1 and the probability of the correct relabeling in Stage 2. Thus, robustness and reliability of our DNN-based system are correlated.In other words, enhancing the robustness of the system results in a system with more reliable outputs.The DNN-based system aims to correctly classify the inputs in settings where adversarial examples is a concern.We measure the reliability of the system by combining the performance of the model in Stage 1 and the performance of relabeling in Stage 2, as explained in Section V-C5.System's vulnerability means the susceptibility of the system to adversarial example attacks.Success rate of relabeling is the number of correctly relabeled elements divided by the total number of the crowdsourced elements.The terms "budget utilization" and "success rate of relabeling" are used interchangeably to indicate OSRA's performance (see Section V-A for further details).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Fig. 1 provides the block diagram of the proposed budgeted crowdsourcing defense scheme against adversarial attacks.The proposed system is composed of two main components.The first component is based on a SOTA adversarial defense algorithm, while the second component is composed of the proposed budgeted crowdsourcing defense layer (OSRA) to complement the SOTA defense algorithm.The OSRA suggests a window size that slides over the stream of output confidence coming from the primary stage (Stage 1) and selects the minimum predicted score (potential adversarial example) within each sliding window.Then, the OSRA moves the potential adversarial examples to a crowdsourcing process (Stage 2) to be validated and corrected.We control the dependence of OSRA on the output of the model in Stage 1 (to ensure better utilization of OSRA) by motivating the OSRA to work on top of SOTA models.In the next subsections, we provide details of each component.

A. Adversarial Defense Algorithm
For the first part (Stage 1), we consider choosing a SOTA adversarial defense technique that effectively mitigates the adversarial examples and minimizes their transferability.An adversarial defense algorithm (ensemble of DNNs, as an example of the SOTA defense techniques) is used as a primary layer that induces predictions with relatively reliable confidences (correlated with the correct labels).In other words, the relationship between the correct labels and the probability of confidence of the predicted labels are maintained.The higher the output confidence, the higher the probability that the predicted label is correct.Hence, the model's output (i.e., confidence scores) will have a meaningful and valid order if they are arranged in ascending or descending order.The relative reliability in output confidences that comes from partially robust models is better than that in the traditional models [24], where models are not incorporated with adversarial defense techniques.Thus, if an adversarial image is fed into a traditional model, it will be misclassified with high output confidence of that wrong label (i.e., the correlation is not maintained).
On the journey of finding the most effective defense technique, we investigated several SOTA adversarial defense techniques [14], [30], [31] and found that the algorithm proposed in [14] is one of the best algorithms for our proposed framework as it has the best reliable outputs and can maintain high accuracy on the clean data.Moreover, it is the most effective defense against black box (transfer-based) attacks [14].
This algorithm is based on a vulnerability diversity metric that is utilized during the ensemble training process to ensure that the ensemble's submodels have a diverse vulnerability.The diverse vulnerability was accomplished through minimizing the overlapped vulnerabilities in the submodels, which was done in a way that aligns well with diversifying the adversarial vulnerabilities of the ensemble's submodels.For that, this article achieves better robustness against the transferability of adversarial attacks between submodels.This approach can be considered as the SOTA in the category of ensemble training methods.
The experiments in [14] show a correlation between the output confidence and the correct prediction.It is a good example of work that aligns well with our proposed technique.Thus, we run our proposed complementary defense technique on top of this work.

B. Budgeted Crowdsourcing: A Complementary Validation Layer
In the proposed budgeted crowdsourcing layer (Stage 2), the confidence of each element forwarded by the previous layer is compared against a set of previous elements depending on the sliding window size.The label of an element assigned by the SOTA algorithm is retained if its confidence is higher than all the elements in the window.Otherwise, the element is forwarded to the crowdsourcing process, where humans are asked to analyze the element and retain or change its label.The sliding steps (comparison window) of the window can be shifted forward one or w steps.If the output confidence of the investigated element is less than all the output confidence of the elements in the previous window, it shifts one step, and w steps otherwise.The intuition behind switching between step sizes is as follows.Since we have a limited crowdsourcing budget, we need to utilize it maximally.That can be achieved by exploring different regions in the stream and assigning at most one budget unit to each region to distribute the budget fairly.To increase the potential of giving fair chances to all regions of the stream, the window slides by one step to give the explored region enough chance to find the most suspicious element in this region.As soon as a suspicious element is found, the window slides by w steps to give a chance to the subsequent regions.
In the next subsections, we provide a pseudocode of the proposed OSRA and explain it with a numerical example.
1) Proposed Algorithm OSRA: Our work (OSRA) is motivated in settings where: 1) security is a concern (adversarial example setting) and 2) the utilized prediction model is SOTA.In the former setting, having a fully robust model (i.e., a model that achieves 100% prediction accuracy) against adversarial example attacks is impossible as it is an open research problem.For that, we enhance the robustness by suggesting OSRA as an augmentation layer.In the latter setting, by utilizing a SOTA model, the accuracy should always be reasonable (it could not be 0%).Without utilizing a SOTA model, there is a chance (extreme case) for using a weak prediction model that results in a stream of false predictions (i.e., a model achieves 0% prediction accuracy).In such a setting, the true selection is a trivial problem because any selection is correct.Thus, our work is feasible in complementing a SOTA model.
In a budget-constraint crowdsourcing situation where we are restricted to a limited number of crowdsourced workers, our proposed algorithm (OSRA) opportunistically utilizes the crowdsourced workers to maximize the ML system robustness.The OSRA strives to use crowdsourced workers efficiently.It selects the stream's prediction scores that have minimal prediction scores and moves them to the crowdsourced workers to be validated and relabeled.In other words, the elements with the corresponding selected prediction scores are the potential adversarial inputs that the OSRA affords to validate and relabel.By this, the number of adversarial inputs gets minimized, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
accordingly, the ML system becomes more robust.A work [40] measures the surprise in the inputs of DNNs.That surprise can be used as a proxy to identify and move suspicious inputs.However, this work requires modifying the architecture of the model, which is infeasible in applications that depend on pretrained models.In addition, unlike our approach, this approach is neither budget constrained nor a theoretically proven approach.
The OSRA utilizes the output confidences streaming from the ensemble of models (Stage 1).It takes the stream of output confidences and does the following.
1) It suggests the optimal sliding window size based on the given allocated budget and the number of expected inputs to be predicted.2) It compares each output confidence with the previous latest inputs found in the optimal sliding window size.3) If the confidence is found suspicious, the OSRA moves it to a budgeted crowdsourcing process to get validated and corrected (relabeled).Otherwise, the OSRA accepts the prediction scores that come from Stage 1 and sends them directly to the final output (without crowdsourcing).The inputs of OSRA are: X (the elements-output confidences-coming from Stage 1), N (the number of the elements), b (the allocated budget), and a sliding window of size w (proposed by OSRA based on N and b).Also, acomparisonf lag (set to zero as a default value) is used to help in deciding whether the upcoming element is picked up (to go through a comparison process) or skipped (considered as a final output).The comparison process decides whether or not to move the selected element to the crowdsourced worker.
Upon the arrival of an output confidence (an element X[i]) to Stage 2, the OSRA checks its satisfaction to two conditions: the availability of the budget (b), and the eligibility of the element to be picked up for a comparison process (i.e., a comparison flag is set to 1, which is the default value).If the former condition is not satisfied (which is the default setting), a set of output confidences of size w is moved to the final output.These elements are also copied into a buffer containing a list of comparison windows.Then, the comparison flag is set to 1.In case the latter condition is not satisfied, the OSRA stops working, and the remaining elements in the stream are moved to the final output.However, if both the conditions are met (ComparisonF lag = 1 and Budget > 0), the element under investigation (the picked up element) is moved to the comparison process to be checked against the confidences in the previous window (the proposed window of size w).If the selected output confidence is found less than all the output confidences in that window, the OSRA moves it to the crowdsourcing process, deactivates the comparison flag (i.e., sets it to zero), and shifts the comparison window w steps forward.A crowdsourced worker validates the corresponding class for the selected element and either relabels it (if Stage 1 has identified it as a wrong class) or confirms that the initially assigned class is correct.The OSRA converts the output confidence for the validated element to 1 as it is now validated and confirmed and then moves it as a final output and decreases 1 from the AllocatedBudget.On the other hand, if the picked up element is found bigger than one of the output confidences in the previous window, the OSRA moves it as a final output, and the Algorithm 1: Proposed Algorithm OSRA.
Input: N : the number of elements to be investigated X : the elements to be investigated b : the allocated budget Output: w : the suggested sliding window size (based on N and b) List of the selected elements to be crowdsourced 1: ComparisonF lag ← 0 2: i ← 0 3: while i ≤ N do 4: if ComparisonF lag = 0 or AllocatedBudget ≤ 0 then 5: if ComparisonF lag = 0 then 6: A set of elements of size w are moved as final outputs 7: They are copied to a buffer containing a list of comparison windows 8: ComparisonF lag = 1 9: else if AllocatedBudget = 0 then 10: OSRA stops working 11: The remaining elements are moved as final outputs 12: end if 13: else 14: X[i] is picked up and moved to the comparison process 15: X[i] is moved to crowdsourcing process 18: The comparison window is shifted forward by w steps 19: ComparisonF lag ← 0 20: Crowdsourcing worker validates X[i] 21: if X[i] = T heGroundT ruth then 22: X[i] is relabeled (assigned the correct class) 23: else 24: X[i] is validated (confirmed that the initially assigned class was correct) 25: end if 26: The output confidence of X[i] becomes 1.0 27: X[i] is moved as a final output 28: The AllocatedBudget is decreased by 1 29: X[i] is moved as final output 31: i ← i + 1 (The comparison window is shifted forward by one step) 32: end if 33: end if 34: end while comparison window shifts only one step forward.The worst case time complexity of OSRA is O((N − w) × w).Since N and w are constants, the time complexity is O(1).To better understand how the OSRA works, we provide an illustrative example (see Section III-B2) that demonstrates the life cycle for a sample stream of output confidence.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

2) Numeric Illustration of the Algorithm: This numeric example simulates the role of OSRA in handling an image classification task.
As soon as the first element in the stream (with output confidence 1.0) reaches Stage 2, the OSRA checks out the status of ComparisonF lag and the AllocatedBudget.It finds that ComparisonF lag = 0 (i.e., this condition is not satisfied).Thus, the OSRA moves three elements (corresponding to the size of the proposed sliding window) as final outputs (without going through the crowdsourcing process), copies them to the buffer, and sets ComparisonF lag = 1 (Algorithm 1, lines 6-8).That creates the first comparison window (the first row in the buffer) of three output confidences, i.e., (1.0), as well as the subsequent two output confidences (0.9, 0.6).
When the fourth element (0.3) comes to Stage 2, the OSRA checks the two conditions.However, now, it finds that both the conditions are satisfied (ComparisonF lag = 1 and AllocatedBudget = 0), so (0.3) is picked up and compared with the first comparison window in the buffer (1.0, 0.9, 0.6).The OSRA checks the element (0.3) and finds it less than all the elements in the comparison window.For that, the OSRA selects and moves (0.3) to the crowdsourcing process, deactivates the ComparisonF lag (set it to zero), and decreases one from the AllocatedBudget (Algorithm 1, lines [17][18][19].A crowdsourced worker validates the element and notices that the initially assigned label mismatches the actual representation of the element (how the image looks), so he/she relabels it (lines 20-24).The OSRA then updates the confidence to 1 and moves it as final output (lines 30-32).
Next, the fifth element in the stream (0.7) arrives at Stage 2, and since the ComparisonF lag is zero, the OSRA moves it and two subsequent elements to final outputs, copies them to the buffer creating (0.7, 0.5, 1.0) as the second comparison window, and sets ComparisonF lag to 1.The eighth element in the stream (0.2) reaches Stage 2, and since ComparisonF lag = 1, the OSRA picks it up and compares it with the latest comparison window (0.7, 0.5, 1.0).The OSRA finds that (0.2) is less than all the elements in the comparison window, so it sends it to the crowdsourcing process, deactivates the ComparisonF lag (set it to zero), and decreases one from the AllocatedBudget.A crowdsourced worker validates the element (0.2) and finds that the initially assigned label mismatches the actual representation of the element, so she assigns it a new label.The OSRA then updates the confidence to 1 and moves it as a final output.
After that, the ninth element (0.3) arrives at Stage 2, and since the ComparisonF lag is zero, the OSRA moves it and the two subsequent elements to final outputs (without going through the crowdsourcing process), copies them to the buffer creating (0.3, 0.4, 0.5) as the third comparison window, and sets ComparisonF lag to 1.
Then, the 12th element (0.8) in the stream reaches Stage 2, and since ComparisonF lag = 1, the OSRA picks it up and compares it with the latest comparison window (0.3, 0.4, 0.5).However, this time, the OSRA observes that the picked-up element (0.8) is larger than one of the elements in the window.Thus, it is not selected, and rather, it is moved directly as the final output, and ComparisonF lag is set to 1 (lines 34 and 35).Fig. 2. Illustrative example that shows how the OSRA works.This numeric example illustrates the role of OSRA in handling a stream of output confidence coming from Stage 1.After applying the OSRA to a stream of 20 output confidences, the achieved performance (the success rate of relabeling) is 66.7%.
A similar process continues until AllocatedBudget becomes zero, and then, the remaining elements in the stream are moved as final outputs.
In summary, as illustrated in Fig. 2, at the end of the process of applying OSRA (presented in Section III-B1) to a stream of 20 elements, the achieved performance (the success rate of relabeling) is 66.7%.The details of the performance evaluation are presented in Section V.
3) Enhancing the Proposed Algorithm With a Heuristic Threshold Selection Method: In the stream of the output confidence coming from Stage 1, we noticed that the higher the prediction score, the more reliable the prediction is.For that, we propose a heuristic threshold selection method to be applied to the stream of the prediction scores coming from Stage 1.Hence, the space of the elements being investigated becomes restricted to the most suspicious elements (the ones with lower prediction scores).To come up with the best threshold, we implemented the following reward-penalize policy (assuming that the distribution of the output confidence of the model at training time and the distributions of the output confidence at inference time are similar).
1) We explore choosing all the model prediction scores located within a relaxed uncertainty interval (the most suspect interval that contains wrong predictions) and consider them candidate thresholds.For example, considering (0.3-0.8) as a relaxed uncertainty interval is justified by the high confidence that the predictions with scores above 0.8 or below 0.3 are correct.
2) For each threshold candidate, we check only the model prediction scores below this threshold (how many were predicted correctly?how many were predicted wrongly?)Then, we assign a weight to each prediction type (1 for true prediction and −1 for false prediction) and sum up them all, getting a value.3) By iterating over all the threshold candidates (step 2), we come up with a vector of values (each corresponding to a candidate threshold).We select the threshold that corresponds to the largest value in that vector.

IV. MATHEMATICAL MODELING
This section proves (mathematically) that our suggested algorithm (OSRA) can efficiently select the suspicious elements (potential adversarial examples) from a stream of output confidence that come in random order.Then, the OSRA moves them to reliable crowdsourced workers (assumed to be honest and highly skilled) to validate and/or correct them.To this end, the number of adversarial examples is reduced proportionally to the allocated budget, and consequently, the system becomes more robust.OSRA solves an optimization problem formulated as follows: Let N denote the number of elements to be predicted, b denote the allocated budget for validating and relabeling the selected elements (the most suspicious ones), and C be the classification model.In a real-time setting, the output confidences corresponding to N elements come sequentially and arbitrarily to OSRA, which selects them using a sliding window-based approach.At the same time, OSRA cannot revoke the selection.The goal is to find the optimal sliding window size (w) where b is minimized, and the success rate of relabeling (SR) is maximized.

A. Notations
1) The stream of elements coming from Stage 1 contains N information, (X ) 1≤ ≤N [14].This information is the output confidences corresponding to the elements in the stream.As these output confidences are coming in real time, they are considered independently and uniformly distributed random variables (RVs) in [0,1].No knowledge is provided on the positions of the lowest output confidences, which come on the spot and in random order.2) Let w be the sliding window size, and b (a positive integer) denotes the allocated budget for validating the potential adversarial examples among (X ) 1≤ ≤N 3) Let Z k refer to the minimum in preceding RVs of size w (counted from index k − 1) 4) Let A k denote the event of finding a just arrived RV X k greater than 5) Let p be a positive integer such that w + p is the index of the first RV that is smaller than its w predecessors, and be the event "the number of elements among (X ) p+w≤ ≤N lesser than their w predecessors is exactly equal to b, i.e., it exists exactly b occurred events Thus Particularly, for p = 1, it is reduced to be the probability to have the event B. Statistical Properties of X i and Z k 1) Probability density function 2) Cumulative distribution function (CDF) 3) CDF of min(X i 1 , X i 2 , .., X i k ) with i j ∈ {1, . . ., N} In what follows, a closed form of the probability defined in ( 9) is provided.As such, the optimum value w * of the window size for a given value of the budget b and the number of elements N is determined, i.e., It may be deduced that such a probability is necessary to compute w * to this purpose.As a result, the following theorem yields a recursive formula that links P w,b and P w,b−1 .In Theorem 2, which closes the computing process, the same probability corresponding to b = 1 is given.In other words, Theorem 2 is a base theorem that contributes to calculating the optimal window size for N elements when the allocated budget for crowdsourcing b = 1, whereas Theorem 1 is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
a generalization for Theorem 2. It contributes to calculating the optimal window size for N elements when the allocated budget for crowdsourcing is more than 1 (i.e., b > 1).
Theorem 1: For an arbitrary positive number b, the following induction formula holds: where and C ,j can be computed by induction as is equivalent to the following event: .
(17) 1) On the other hand, if N = b(w + 1), then the following holds: 2) If N > b(w + 1), then such a probability can be evaluated leveraging (17) as where and As the RVs in the ranges [1, w + k − 1] and [w + k, N ] are independent identically distributed (i.i.d.), and for the sake of notations' simplicity, such intervals are represented by the number of their elements instead.That is, P can be denoted as w,b for arbitrary values of w, b, and N. Therefore, ( 14) is attained.
To finalize the Theorem 1's proof, it is sufficient to evaluate B i,j .Toward this end, let us distinguish two cases.
1) Case 1 (i = j): Using the definitions (1) and ( 2) alongside ( 10) and (12), one obtains 2) Case 2 (j > i): The above probability can be rewritten as By induction, ( 23) can be expressed as Now, substituting ( 22) into ( 24), ( 15) is attained.Besides Given that A is occurring, i.e., X < min(X −1 , . .., X −w ), the event A j , . . ., A +1 |A has two different forms depending on the value of w: Subcase 1 (w = 1): The joint event in ( 26) can be written as the union of the following j − + 1 independent single events: Therefore, ( 26) can be written as a summation of the above events' probabilities.Besides, as X i are i.i.d.RVs, these probabilities are equal.Thus, C ,j can be evaluated as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

+ Pr
Subcase 2 (w ≥ 2): Under this subcase, and by assuming that X < min(X −1 , . .., X −w ), Z k defined in ( 22) is reduced for an index in the range It follows that: Consequently, two different subcases can be distinguished depending on the existence of the first event in (30) (i.e., the intersection of events).Subcase 2.1 ( + w ≥ j): In this case, such an intersection of events is not defined.As a result where step (a) holds using (10) and (12).Subcase 2.2 ( + w < j): In this case, the intersection of events is not empty.That is, Now, using ( 22), (24), and ( 31), ( 16) is attained, which concludes the proof of Theorem 1. Owing to the above, it is sufficient to evaluate P w,1 with N > w + 1 so as to evaluate P (N ) w,m for m > 1 according to (14).Theorem 2: Suppose b = 1.The following equation holds: Proof: First, it can be noticed that E } N −w≤k≤N .The latter events are corresponding to the last subinterval [N − w, N ] of width lesser than w + 1. Therein, if A k holds, then all the events (A j ) j≥k+w+1 are not defined within the range [1, N], which yields It is worthy to mention that the third term in (34) equals 0 if N − w − 1 < w + 2, i.e., N < 2w + 3. Now, by noticing that the set of events {A k+w+1 , . . ., A N } is independent of {A w+1 , . . ., A k−1 , A k }, applying the Bayes rule, and performing some algebraic operations, (34) becomes Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

A. Performance Evaluation
Given a stream of elements N , an allocated budget b, where b < N, let d be the portion of the budget used for validation without relabeling (i.e., when a crowdsourced worker finds that the initially assigned label is correct), and c be the portion of the budget used for validating and relabeling (i.e., correcting an initially assigned wrong label).Therefore, the OSRA uses c budget units for validating and relabeling while using d for validating without relabeling, such that c + d = b.To measure the performance of OSRA, we consider the budget units used for validating and relabeling (i.e., c) because they are associated with relabeling, which contributes directly to improving the system's accuracy.Therefore, the performance metric is the success rate of relabeling SR.Though d contributes to augmenting the reliability of the system in an indirect way (the validation makes us more confident about the output label), we do not consider it as it has no direct effect on our performance metric (i.e., success rate of relabeling).The success rate of relabeling SR is defined as the number of corrected (relabeled) elements divided by the overall crowdsourced elements.In this article, we interchangeably use the terms "OSRA's budget utilization" and "OSRA's success rate of relabeling." The set of successfully relabeled elements (SRE) contributes to the overall system robustness as an added robustness (AR) as follows: SRE = SR * b, and AR = SRE/N .Therefore, SR, SRE, and AR are positively correlated.

B. Experimental Setup
Our work aims at enhancing the robustness of cutting-edge DNNs against black box transfer-based attacks.It can complement any relevant SOTA adversarial defense approach.For that, as an example, we built our experiments on top of an ensemblebased adversarial defense technique [14].In particular, we used a pretrained ensemble model trained on the CIFAR-10 dataset in a novel ensemble training process that maximizes the diversity between each pair of submodels and minimizes the transferability of adversarial examples.The selected ensemble model comprises three submodels; each is based on ResNet-20 [41].The final prediction of the ensemble is the average of the output probabilities of the submodels.
The model on Stage 1 was trained for 200 epochs using stochastic gradient descent, momentum 0.9, weight decay of 0.0001, and an initial learning rate of 0.1.It was decayed by 10x at the 100th and the 150th epochs [14].We applied the ensemble model to a test dataset composed of 1000 adversarial examples corresponding to the original sample of CIFAR-10.
The adversarial examples are generated from a surrogate ensemble, which was configured with three submodels, skip gradient method (SGM) [42] as the attack methodology, Carlini-Wagner (CW) [17] as the loss function, and attack strength (epsilon) = 0.01.The accuracy of the ensemble (trained submodels) under this setting is 83.2%.The stream of the results generated by the ensemble model in the aforementioned experimental setting is used as a base for performing our experiments.We assume that  3. We compare the success rate of relabeling of our proposed algorithm (OSRA) against offline and random selection algorithms.It can be seen that the OSRA achieves higher performance than random selection algorithm and comparable performance to the offline selection algorithm (the offline benchmark).
the crowdsourced workers utilized in Stage 2 are reliable (honest and highly skilled).
In all our conducted experiments, to be more confident about the gained results, we repeated each experiment 10 4 times and averaged the results.We also considered the default setting to be a dataset of size 1000 and allocated budget of 10% (i.e., the budget that is enough for crowdsourcing 10% of N ).
Table II summarizes the investigated factors in the conducted experiments.

C. Experimental Results
In this section, we present the results of different experiments conducted in this article.We dedicated Section V-D to go beyond the presented results and show intuitions and interpretations.
1) OSRA Performance: We compare OSRA against two baselines including an offline and a random selection algorithm.The offline selection algorithm achieves the best performance because all inputs are assumed to be available before the selection process.Hence, it needs only to sort the elements (e.g., ascending) and select the first portion of the elements based on the available budget.As shown in Fig. 3, our proposed algorithm achieves higher performance (success rate of relabeling) than that of the random selection algorithm.It also achieves competitive performance to the offline algorithm.The more success rate of relabeling we have, the more adversarial examples are captured and fixed (corrected labels), and thus, the more added robustness we can achieve.That is, added robustness and budget utilization are linked with each other (see Section V-C5 for further details).
2) OSRA's Optimal Sliding Window Size: To empirically investigate the optimality of the window size suggested by our proposed algorithm (the theoretical proof is mentioned in Section IV), we performed a set of experiments using various Fig. 4. Investigating the optimality of the suggested sliding window by checking various window sizes.We can see that the window size suggested by the OSRA is optimal.It efficiently leverages the allocated budget to crowdsource and validate the most suspicious elements (the potential adversarial examples).Fig. 5. Impact of allocated budget on the success rate of relabeling (considering our algorithm's suggested window and dataset size 1000).We observe a strong negative correlation between the allocated budget and the success rate of relabeling.The more the allocated budget, the less the success rate of relabeling.The OSRA is optimized to encourage using less budget.window sizes: 1) two window sizes lower than the suggested window size w (lower by 20% and 40%, respectively) and 2) two window sizes larger than the suggested window size (higher by 20% and 40%, respectively).We observed that OSRA performance SR is higher when we conducted the experiments using the window of size w (the OSRA suggested window size), as illustrated in Fig. 4. At the core of OSRA is suggesting the optimal sliding window size; hence, the factors that have impact on window size are necessarily affecting OSRA.
3) Impact of Data Size and Allocated Budget on OSRA: The suggested sliding window size depends on N and b (i.e., the stream of output confidences coming from Stage 1 and the allocated budget).To inspect the impact of each of which on the OSRA performance, we investigated various dataset sizes: 1000, 800, 600, 400, and 200 (fixing allocated budget to 10% and window size to w), and different allocated budgets b: 3%, 6%, 9%, 12%, and 15% (fixing dataset size to 1000 and window size to w).
We observe that the change in data size N has a trivial impact on the success rate of relabeling SR, i.e., there is no correlation between N and SR.On the other hand, there is a strong negative correlation between allocated budget b and SR.That is, the more the allocated budget, the less the success rate of relabeling (see Fig. 5).One interpretation is that our proposed algorithm is optimized for reducing the allocated budget (it encourages using smaller budget).Further related discussion is in Section V-D.For more convenience and to easily observe the impact of the Fig. 6.Investigating the success rate of relabeling for various allocated budgets and window sizes (a) when the dataset size is 1000, (b) when the dataset size is 800, and (c) when the dataset size is 600.allocated budget b, the sliding window size w, and the size of the dataset on the success rate of relabeling, we illustrated them together in Fig. 6.We have conducted experiments on various allocated crowdsourcing budgets (5%, 10%, and 15%) and various sliding window sizes (bigger and smaller than w by 20% and 40%, respectively) on datasets of sizes 1000, 800, and 600.We found that the more allocated budget b, the less performance achieved (less success rate of relabeling).We also found that the larger the sliding window size, the higher the performance.That is, there is a negative correlation between the allocated budget and the success rate of relabeling (b and SR) and a positive correlation between sliding window size and the success rate of relabeling (W and SR).A similar trend was shown when explored with different dataset sizes.That means there is no strong correlation between the dataset size and OSRA performance.

4) Enhancing OSRA With a Stream Heuristic Threshold Selection Method:
To enhance the performance of OSRA, we proposed heuristic threshold selection to filter the most suspicious elements in the stream of output confidences.We conducted Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 7.We compare the success rate of relabeling of our proposed algorithm (OSRA) against offline and random selection algorithms using the proposed stream heuristic threshold selection method.It can be seen that the proposed heuristic threshold can significantly enhance OSRA performance.When we apply the proposed heuristic threshold, the negative correlation between the allocated budget and the success rate of relabeling becomes higher, likewise the positive correlation between the window size and the success rate of relabeling.Fig. 8. Investigating the impact of allocated budget (as an example of the factors that contribute to success rate of relabeling) on the system robustness.There is a positive correlation between the allocated budget and the added robustness to the system.a set of experiments that show the difference in success rate of relabeling when we consider the stream with and without applying the threshold.We repeated the experiments conducted in Section V-C1 and illustrated in Fig. 3 but with a threshold.We noticed that the negative correlation between the allocated budget and success rate of relabeling (b and SR) and the positive correlation between the sliding window size and success rate of relabeling (W and SR) became higher when we applied the proposed threshold.As illustrated in Fig. 7, the OSRA utilizes almost 70% of the allocated budget compared with 85% for the benchmark (the offline algorithm).This shows that our proposed heuristic threshold scheme can significantly enhance the performance of OSRA.
5) Impact of OSRA's Performance on System Robustness: In our work, the system's robustness is determined by the prediction accuracy of a SOTA model (Stage 1) and the added robustness (Stage 2).The added robustness (AR) is what our proposed algorithm OSRA contributes to the overall robustness of the system.In the case shown in Fig. 7, with a stream of 1000 elements allocated budget of 100 budget units (i.e., the budget that is enough for crowdsourcing 10% of elements in the stream), and the success rate of relabeling SR 70%, we calculate the added robustness as follows: SRE = 70% × 100, which is 70 elements, and AR = 70/1000, which is 7% (as shown in Fig. 8).Considering that the accuracy of the utilized SOTA model is 83.2%, the overall robustness becomes 90.2% (83.2% + 7%).The added robustness can be higher with more allocated budget.

D. Discussion
In this section, we go beyond the reported results in Section V and provide intuitions and interpretations to the reported results.The online selection algorithm cannot meet the performance of an offline selection algorithm.An online algorithm is called a competitive algorithm in case the proportion between the performance of online and offline algorithms is bounded [18].In Section V-C1, we noticed that the performance of OSRA is competitive to the performance of the offline selection algorithm.By repeating the experiments of Section V-C1 10 4 times, we found that the ratio between our proposed algorithm (OSRA) and an optimal offline algorithm is 0.67.
The results in Section V-C2 show that the success rate of relabeling is higher when we conducted the experiments using the optimal window size w.Increasing the window size by a certain percentage means that the potential of selecting more elements that satisfy the allocated budget b is lower.In other words, the allocated budget may not be used completely as the total number of selected elements might be lower than the available budget.Likewise added robustness is lower when we decrease the window size.A lower sliding window size w means a lower potential of finding a suspicious element.Thus, the optimal window size should be large enough to select a suspicious element and small enough to allow for selecting elements that meet the allocated budget b.
The results in Section V-C3 demonstrate a negative correlation between the allocated budget b and the success rate of relabeling SR.The results also demonstrate a positive correlation between w and SR.This occurs because the higher b means the lower size of the suggested window w (that is how our proposed algorithm adapts to select a suspected element from each window), as explained in Section IV.Therefore, a lower window size means that the potential of finding a valid suspected element is lower and vice versa.
The experimental results in Section V-C3 show that there is a trivial impact of dataset size N on the success rate of relabeling SR.This trivial impact resulted from the fact that the dataset size proportionally relates to the allocated budget (we consider it as a percentage from N ).Therefore, whenever we increase or decrease the dataset size, the allocated budget will change accordingly.
Comparing the results in Section V-C1 with the results in Section V-C4, where a heuristic threshold selection method was applied, we observe a remarkable improvement in the success rate of relabeling.It narrows down the space of the suspicious elements, which increases the probability of selecting the most suspicious elements.Without filtering the suspected elements, there will be many cases where our proposed algorithm selects less suspected elements (elements that have the lowest output confidence within the sliding window); however, in reality, they are with higher output confidences (not suspected).Hence, the elements corresponding to these lower output confidences get selected to be crowdsourced.

Fig. 1 .
Fig.1.Block diagram of the proposed budgeted defense scheme against evasion attacks.The proposed scheme comprises two stages: a SOTA adversarial defense algorithm (Stage 1) and the proposed budgeted crowdsourcing layer (Stage 2), which complements Stage 1 to form a more robust system against evasion attacks (also known as adversarial examples).

TABLE II SET
OF EXPERIMENTS AND THEIR CORRESPONDING INVESTIGATED FACTORS Fig.