Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression

Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks. However, DL models are often strikingly vulnerable to adversarial attacks. Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assuming no model knowledge is accessible to attackers). These attacks typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity. However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing token directionality and overlapping token sensitivities, respectively. In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression. We evaluated XATA's attack performance on DL models executing text classification on the IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity datasets and DL models conducting text regression on the My Personality, Drug Review, and CommonLit Readability datasets. The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks. These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks.

The overall accuracy of the DL-based toxic content detector dropped from 92% to 22% when legitimate examples were slightly perturbed [9].The significant impacts that perturbations have on a DL model's behavior have motivated researchers to explore adversarial attacks to generate adversarial examples for a myriad of downstream tasks, such as model robustness assessment (e.g., TextFlint) [10], [11] and dataset augmentation (e.g., CoDA) [12].Increasingly, a greater impetus has been placed on assessing the effects of adversarial attacks on DLbased models operating in critical text classification tasks (i.e., categorizing an input text content into a discrete label) or text regression tasks (i.e., assigning a continuous value to a text input) operating in high-impact e-commerce, medical, social media, and business intelligence applications.For simplicity, we use the term "adversarial text attack" to refer to adversarial attacks on text classification and text regression.
Adversarial text attacks can be categorized into two types based on how much information an attacker has about the target model [13], [14].The first type is white-box attacks (worst-case scenarios), where attackers know all model details, such as training data, parameters, etc. [6], [15].However, it is often unrealistic to assume that model details are accessible to attackers.The second type of adversarial text attack is black-box attacks that assume attackers have no access to model details and can only query the target model to infer its information [16], [17].Black-box attacks have attracted significant attention from researchers due to their realism (compared to white-box attacks).
Prevailing adversarial text attacks (e.g., DeepWordBug [7], TextFooler [18], and Probability Weighted Word Saliency (PWWS) [19]) operate with a two-phase framework: (1) sensitivity estimation that measures the sensitivity of the prediction change to each input token such as a word or character and (2) perturbation execution that crafts perturbed adversarial examples based on token sensitivity.Accurately identifying sensitive tokens in the first phase is the foundation for successfully executing the perturbation in the second phase [18].Existing methods for token sensitivity estimation include gradient-based and deletion-based methods.Gradient-based methods use the gradients of model prediction with respect to the input tokens to compute the sensitivity.However, these techniques ignore the direction of sensitivity (i.e., positive or negative).As a result, the effects of perturbing different words may cancel each other out.Deletion-based methods remove a token from the example and compute the sensitivity of the token based on the difference in model outcome due to the deletion.However, the estimated token sensitivities may overlap (i.e., the sensitivity of a token combination is no greater than the sum of individual sensitivities), resulting in a suboptimal selection of sensitive tokens to be perturbed.
At their core, black-box adversarial attack methods and local post-hoc explainable methods, which are often used to explain the outputs of DL models, both aim to estimate token sensitivity, operate on the individual example level, and are model agonistic [3], [20].Hence, local post-hoc explainable methods could be used to estimate sensitivity for crafting adversarial attacks.In particular, additive feature attribution local post-hoc explainable methods, namely Local Interpretable Model-agnostic Explanations (LIME) [21] and Shapley Additive exPlanations (SHAP) [22], train a linear explanation model to approximate the target DL model's behavior locally.These techniques could overcome the drawbacks of existing token sensitivity estimation methods for two key reasons.First, they capture each token's sensitivity by examining the parameters of the trained linear explanation model [21], [22], [23].Hence, the direction of sensitivity could be captured (e.g., negative or positive) as the parameters of the linear model are unbounded.Second, the additive attribution design of these techniques requires that the total impact of a token set is equal to the sum of individual sensitivity scores.This requirement can help address the issue of overlapping token sensitivities.
In this study, we propose an eXplanation-based method for crafting Adversarial Text Attacks (XATA) on classification and regression.Specifically, we use additive feature attribution explainable models, LIME and SHAP, to measure the sensitivity of each token to the target model prediction.We perturb the tokens according to the sensitivity scores provided by the explanation model.We adopt the commonly used visually-similar-character replacement perturbation strategy proposed by [7] in the second phase.For simplicity, we name the LIME-based and SHAPbased attack methods as XATA-LIME and XATA-SHAP, respectively.We executed experiments to test XATA's performance on text classification and text regression.Consistent with previous studies [3], [24], the classification task was based on the IMDB Movie Reviews, Yelp reviews-Polarity, and Amazon reviews-Polarity datasets; the regression task was based on the My Personality, Drug Review, and CommonLit Readability datasets.We found that XATA-LIME and XATA-SHAP crafted more effective adversarial examples than baseline techniques.Since SHAP often produces better explanations than LIME [22], these results indicate that methods with more substantial explanatory power can execute sensitivity estimation at a higher accuracy and have more impactful adversarial attacks on DL.These findings suggest a contradiction (trade-off) between methods for explainability and adversarial robustness: approaches providing more substantial explanatory power (e.g., SHAP over LIME) can lead to increased adversarial attack effectiveness against DL.Our contributions are three-fold: r First, we propose a new method to craft adversarial text examples for black-box attacks on two common NLP tasks (text classification and text regression).The proposed attack method outperformed existing baselines (i.e., gradient-based and deletion-based attack methods) on multiple datasets.
r Second, this study helps reveal connections between black- box adversarial attack methods and local post-hoc explainable DL methods.This study also identifies the advantages of using additive feature attribution explainable methods for adversarial text attacks.
r Finally, we empirically demonstrate the contradiction (trade-off) between explainability and adversarial robustness in DL models.When researchers continually advance the explainability of DL models, they also provide attackers with tools to launch targeted and effective adversarial attacks.The rest of the paper is organized as follows.Section II reviews the related work.Section III details the proposed XATA.Section IV describes the evaluation results.We discuss XATA's application in white-box attack scenarios and the trade-offs between explanation and adversarial robustness in Section V. Section VI concludes this research.

A. Adversarial Attacks on Text Classification and Text Regression
Adversarial attacks on text classification and text regression (adversarial text attacks) can be divided into white-box attacks and black-box attacks [13].Compared with white-box attacks, attackers conducting black-box attacks are unaware of a model's details.Still, they can infer the target model information with specialized queries [16].Most studies have created adversarial attack methods for black-box scenarios as they are more realistic [25], [26].
Mathematically, we denote a legitimate example w consisting of a sequence of N tokens as w = [w 1 , . . ., w n , . . ., w N ].Attackers can access the model's prediction y = F(w) as well as the model output's continuous scalar value F y (w) that is used to reach the prediction y.For classification, F y (w) is the predicted probability for class y; for regression, F y (w) is the same as F(w).A black-box adversarial text attack aims to craft an adversarial example w A that misleads the model to a preferred prediction ŷ without accessing model details (i.e., model agnostic).The difference between ŷ and the initial prediction y should exceed a certain degree c, i.e., D(y, ŷ) ≥ c.D(y, ŷ) is often simplified to y = ŷ for classification and to L 1 norm (i.e., |y − ŷ| ≥ c) for regression.Due to the discrete nature of text data, the text examples are restricted in a space W. The crafted example w A should be similar to the original example w.We use S(w, w ) to denote a domain-specific similarity function S : W × W → R + .The required minimum similarity between w A and w is .Then, the attack task can be formally denoted as follows in (1).

B. Methods of Crafting Adversarial Examples
Since the optimization in (1) is hard to solve, past studies adopted a two-phase heuristic solution to craft the adversarial examples.Phase 1 is sensitivity estimation, which measures the sensitivity of the prediction to each input token (e.g., word or character).Phase 2 is perturbation execution.As perturbing a highly sensitive token will yield a more significant prediction change than perturbing a slightly sensitive one, the most sensitive tokens are perturbed to craft adversarial examples.
Formally, s y w n denotes the sensitivity score of w n towards the prediction y.Phase 1 aims to estimate the sensitivity score for each word token s y w n , ∀n with a mapping function, as shown in (2). ( We use s y to denote the collection of s y w n , ∀n, i.e., s y = [s y w 1 , s y w 2 , . . ., s y w N ].Phase 2 aims to craft an adversarial example w A that can mislead the model.We denote the perturbation function as f P (•), and w A is crafted by: Identifying sensitive tokens in (2) lays the foundation for executing effective perturbation in (3).Two types of methods have been proposed to solve (2): gradient-based and deletion-based.We describe each method in the following sub-sections.
1) Gradient-Based Methods: Proposed by Papernot et al. [27], gradient-based methods compute the gradients of the model's outcome to the input token.Since a higher gradient means that changing the token leads to more changes to the outcome, the gradient is used as the sensitivity score for the subsequent phase [8], [9], [28].However, attackers cannot access the details of the target model.Therefore, a surrogate model is trained to approximate the target, and the token sensitivity is then estimated from the surrogate model [29].Formally, F and Fy denote the counterparts of the surrogate model and w e n denotes the embedding vector of w n .Sensitivity score s y w n is calculated with (4): where ∇ w e n is the gradient of w e n and || • || 2 is the L 2 norm.As the actual value of the token indicates how strongly a token is expressed, ( 4) is extended by multiplying the gradient with the token embedding [30], given by: (5) 2) Deletion-Based Methods: Deletion-based methods drop a token from the example and query the target model with the new data example.The difference in the model outcome before and after deletion reflects the sensitivity of this token.Formally, the example after deleting the token w n is denoted as [8] computes the sensitivity of w n according to (6): DeepWordBug [7] extends the computation by considering the sequentiality of the input.For each token w n , the Temporal Head Score (THS) is defined as the outcome difference between two heading parts of an example, while the Temporal Tail Score (TTS) quantified the difference between two tailing parts of the example.A hyperparameter λ lambda controls the weight of TTS.The calculation is: PWWS [19] combines token saliency and the continuous scalar F y (w).The saliency of the token is defined the same as (6), except that w n is replaced by an unknown token instead of being deleted.Δp * represents the maximum change of F y (w) after w n is replaced with different strategies, the sensitivity is calculated by (8): TextFooler [18] further extends the computation by considering the prediction given by ( 9): methods encode discrete tokens (e.g., words) as real-valued embedding vectors, the gradient ∇ w e n Fy (w) corresponds to the embedding dimensions, not directly reflecting the overall influence on the token level [3].Consequently, the gradient needs to be aggregated into the token level (e.g., word level) through L 2 norm ((4) and ( 5)).This computation misses the direction of sensitivity (i.e., positive or negative).Therefore, the effects of different words may cancel each other out when the attack is gradient-based.
Although deletion-based methods can distinguish tokens that contribute negatively or positively, they only compute the sensitivity score one token at a time.In reality, multiple tokens may need to be perturbed to craft an adversarial example.As a result, sequentially perturbing the tokens based on the deletion-based sensitivity scores may not be effective as token sensitivities may overlap.For example, assume that the score of three tokens w 1 , w 2 , w 3 are s 1 , s 2 , and s 3 (s 1 > s 2 > s 3 ).Further, assume w 1 is semantically close to w 2 , and w 3 is semantically distant from w 1 and w 2 .The total effects of perturbing w 1 and w 2 may not be more significant than perturbing w 1 and w 3 because the sensitivity of w 2 may overlap with w 1 , while the sensitivities of w 1 and w 3 are less overlapped.

III. EXPLANATION-BASED ADVERSARIAL ATTACKS
This section first describes explainable DL, namely additive feature attribution explainable methods, to identify why they can help craft adversarial text examples.We then describe and formalize the proposed eXplanation-based Adversarial Text Attacks (XATA) method.

A. Feasibility of Explainable DL Methods for Adversarial Attacks
DL models are often criticized for being black boxes, where the results of the model lack interpretability [31], [32].Lack of interpretability can prevent developers from improving the models and can cause end-users to distrust model decisions [33].These concerns have given rise to research focused on enhancing the explainability of DL models [20].Explainable DL methods can be broadly grouped into global and local methods [3].Global explainable methods enable end-users to inspect and visualize the model structures and parameters.Local explainable methods focus on the prediction rationale for a particular example to reveal the role of each token in the example (represented by a score that reflects the sensitivity).Since the mathematical description is the same as (2), local explainable methods can execute the same task with Phase 1 in adversarial text attacks.The outcome of the local explainable methods satisfies the requirement of Phase 2 in adversarial text attacks.Consequently, local explainable methods could be leveraged for adversarial text attacks.
Local explainable methods can be further divided into intrinsic and post-hoc methods [34].Compared with intrinsic methods that design self-explainable models to offer explanations, post-hoc methods introduce an explanation model (e.g., linear regression) after the target model is trained to analyze (e.g., locally approximate) the target model's behavior [33].Post-hoc methods can produce an explanation for a model without accessing the details (e.g., structure, parameters) of the model, and therefore they can be model agnostic.Hence, post-hoc methods could be used for black-box attacks.Fig. 2 illustrates the connections between the local post-hoc methods and black-box adversarial text attacks.
In summary, local post-hoc methods (1) aim to estimate the token sensitivity, (2) operate on the individual example level, and (3) can be model-agnostic, satisfying the requirements of black-box attacks.Since local post-hoc methods are suitable for black-box attacks, we review them further in the following sub-section.

B. Advantages of Additive Feature Attribution Explainable Methods for Adversarial Attacks
To date, two types of local post-hoc methods have prevailed: rule-based explainable methods (e.g., Anchors [35]) and additive feature attribution explainable methods (e.g., LIME [21]).Rule-based explainable methods operate by finding a set of rules that identify how well tokens within a given input that belong to the rules correlate (contribute) to the model's output.Since estimating sensitivities requires correlating each token to the model output, rule-based methods are not the most suitable as they only correlate tokens that appear within the set of rules (i.e., could only estimate sensitivity for tokens in the rule set).In contrast, additive feature attribution explainable methods correlate all input tokens by defining a linear explanation model and assuming that the contributions of each token within an input are additive [22].Particularly, given an example w, the explanation model F E is defined by: where φ = {φ n } N n=1 and b are the parameters for the linear model, and E(w n , w) is a binary variable that shows the existence of token w n in w.E(w n , w) is given by: The parameter φ n in the model reflects changes without token w n and is equivalent to the sensitivity of token w n .LIME (most fundamental) and SHAP (more advanced than LIME; satisfies properties of local accuracy, missingness, and consistency) are the two prevailing and representative additive feature attribution explainable methods [22].
Compared with gradient-based methods, the direction of each token's sensitivity (i.e., positive or negative) can be captured since the parameters φ in the linear explanation are unbounded (i.e., can be negative or positive).Moreover, as the contributions of each token are additive, the total impact of a set of tokens is equal to the sum of the sensitivity score of each token.Hence, it also addresses the overlapping token sensitivities issues that deletion-based methods face.Such advantages can help precisely estimate token sensitivity and craft effective adversarial examples.

C. Proposed Explanation-Based Adversarial Text Attacks (XATA) Method
Consistent with prior studies, the proposed XATA has two phases.Phase 1 is sensitivity estimation, which leverages the additive feature attribution explainable methods to compute the sensitivity of each input token.Phase 2 is perturbation execution which conducts a perturbation according to token sensitivity from the explainable methods.We first demonstrate the attack method on classification and then show how the method can be applied to regression.The framework of the method is shown in Fig. 3.

D. Adversarial Attack on Text Classification
Consistent with prior black-box attack settings for classification [3], [18], [19], [36], [37], the attacker can query the target model with a text example w consisting of a sequence of N tokens (w = [w 1 , w 2 , . . ., w N ]).The model will return the predicted class as y, i.e., F(w) = y, as well as the predicted probability for class y F y (w).Similar to (10), the surrogate model is given by: The surrogate model's parameters θ, including the bias b and sensitivity scores s y , were optimized to approximate the target model's behavior for w.With the optimized θ, tokens are sorted based on sensitivity scores s y , where I y is the index vector and f S is the function that sorts its tokens in descending order.
2) Phase 2: Perturbation Execution: As mentioned above, the only information required from Phase 1 by Phase 2 is token sensitivity scores.Hence, most perturbation strategies [7], [8], [18], [19], [28], [37] and common perturbations (e.g., insert, removal, or replacement) from prior studies apply to our method.Furthermore, additive feature attribution explainable methods can explain models on both the word or character levels, depending on the perturbation granularity of w.We select the visually similar character replacement strategy proposed by [7], where word tokens are replaced with other visually-similar characters.For instance, "o" can be changed to "0" and "1" can be changed to "l" when crafting adversarial examples.Such a perturbation strategy is often human-imperceptible and therefore retains the semantics as the legitimate examples.It is also widely adopted by previous studies [9], [38].
The perturbation is conducted according to their sensitivity scores.We first generate the replacement for the most sensitive words w I y 1 , which is given by: where f P is the perturbation function.Then, we replace w I y 1 with w I y 1 to obtain the crafted example w as: If the crafted example cannot mislead the target model, we repeat the same perturbation to the second sensitive word , the third sensitive word, and so on until the crafted example can successfully mislead the target model to make a prediction ŷ (ŷ = y).In this way, we obtain adversarial text examples w A .Since LIME and SHAP are two prevailing additive feature attribution explainable methods (for reasons described in Section III.B), we elaborate on how XATA operates with LIME and SHAP.
3) LIME-based Attack: XATA-LIME: According to LIME's algorithm, we use five steps to estimate token sensitivities for attacks: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.n .The kernel is defined in ( 16): where Ω(θ) is the number of non-zero weights, measuring the complexity of the explanation model.Thus, it also needs to be minimized to ensure explanation quality.Optimization techniques such as gradient descent algorithms can determine model parameters, including the sensitivity score for each token.Then Phase 2 is executed as described in Section III.D.2. 4) SHAP-Based Attack: XATA-SHAP: XATA-SHAP leverages the Kernel SHAP explainable method to estimate the token sensitivity scores s y w n , ∀n.The process of XATA-SHAP is similar to that of XATA-LIME, except that the Shapley kernel is adopted in Step 4 to calculate the proximity.In particular, where |v| denotes the number of non-zero dimensions, i.e., |v| = n=N n=1 I[v n = 1].Accordingly, the optimal parameters of the surrogate model are obtained by (19): The Shapley kernel guarantees nice properties for sensitivity estimation (e.g., local accuracy, missingness, and consistency [22]); thus it estimates a more accurate token sensitivity than LIME.Moreover, the sensitivity scores estimated by SHAP, called Shapley values, explain each token's contribution toward the model's outcomes.The Shapley value is highly informative as it predicts how the model will behave without each token.

E. Adversarial Attack on Text Regression
The proposed attack method can be extended to text regression.In text classification, F y ( wm ) ranges from 0 to 1, but the F y ( wm ) in text regression is a continuous, unbounded, real-valued scalar.However, an unbounded F y ( wm ) does not affect the application of XATA, as additive feature attribution explainable methods like LIME and SHAP can also estimate the sensitivity score for text regression.Therefore, the sensitivity estimation procedure for text regression is the same as text classification.XATA is stopped when the crafted example can mislead the target model to a certain degree, e.g., |F y (w ) − F y (w)| > c.

IV. EVALUATIONS
The proposed XATA is evaluated with experiments conducted on text classification and text regression.We summarize the datasets, attacked DL models, evaluation metrics, and results for each task in the ensuing sub-sections.

2) Baseline Attack Methods:
We compared the proposed XATA against the adversarial text attack baselines described in Section II: TextBugger [9], Gradient * Input [30], DeepFool [8], DeepWordBug [7], PWWS [19], and TextFooler [18].We use the visually similar character replacement strategy introduced in Section III for all baselines to ensure a fair comparison.In this way, the performance differences are attributed entirely to the sensitivity estimation.The token sensitivity was computed according to (4)- (9).Note that the gradient-based methods require training a surrogate model.
However, the training process may introduce extra biases, making a fair comparison difficult.Therefore, we used the target model as a surrogate model, i.e., the gradient-based methods were conducted in a white-box setting.Consistent with prior studies [7], [9], [18], [19], [40], we also included a method that randomly assigned sensitivity scores to words in the example and repeatedly perturbed the word with the highest score until the target model was fooled.

r RNN (Long-Short Term Memory (LSTM), Gated Re- current Unit (GRU)):
The first layer is an embedding layer with an embedding matrix.We used the pre-trained 100-dimension GloVe4 word embedding to transform the discrete word inputs into dense vectors.The second layer is the RNN layer (LSTM or GRU) with 128 hidden neurons.The final layer is a fully connected layer for classification.
r CNN: The first layer is the same as RNN-based classifiers, which map discrete textual examples to vectors.In the second layer, filters of different sizes are used for convolution operation.Each filter has an adjustable width and a fixed length as the embedding dimension.We used three different width filters (3, 4, and 5), and the channel number for each filter is 100.We then applied max-pooling on feature vectors obtained by the previous convolution operation and used a fully connected layer for classification.
r BERT: We used the pre-trained BERT-base-uncased ver- sion 5 , a 12-layer BERT with 768 hidden units and 12 heads.We added a fully connected layer for classification and then fine-tuned BERT on our datasets.We adopted Adam Optimizer [41] to optimize the parameters.The learning rates were 5e-3 for RNN and CNN and 5e-5 for BERT.All models were trained with a holdout test strategy in the original training data (i.e., 80% for training and 20% for validation) and tested in the testing data for each dataset.Each model was implemented with PyTorch.The experiments were conducted on a server with an Intel Xeon Gold 6226R CPU @ 2.90 GHz and two NVIDIA GeForce RTX TM 3090 GPUs with 24 GB GDDR6X.

TABLE I DATASET SUMMARY FOR CLASSIFICATION TASK
Evaluation Metrics: We crafted adversarial examples for the examples in the test set to verify the effectiveness of the proposed XATA.We adopted three evaluation metrics commonly used in previous adversarial attack studies; each was averaged over the crafted examples [9], [18], [37].
r Success Rate@N is calculated by computing the success rate of adversarial examples with up to N% of the tokens in the legitimate examples to be perturbed.This metric evaluates the effectiveness of an attack method given a perturbation upper bound (i.e., N%).
r Perturbation Rate is calculated by counting the minimum number of words perturbed in the legitimate example when the target model is attacked successfully.Since the lower the perturbation rate, the higher the imperceptibility, this metric represents the quality and threatening level of the crafted adversarial examples.
r Perturbation Impact@N is calculated by the deviation between the predicted probability of the adversarial example and the legitimate example, where up to N% of the tokens in the adversarial example are allowed to be perturbed.This metric evaluates the impact of adversarial examples on the target model.II.BERT made correct decisions for all legitimate examples with a predicted probability of more than 90% in all cases.However, all decisions were altered with adversarial examples.In the IMDB dataset, the critical review was correctly classified as negative with a probability of 99.9%.However, an incorrect decision was made after perturbing the most three sensitive words, i.e., changing "AVOID" to "AV0ID," "fails" to "fɑils," and "FLAT" to "F1AT."Such perturbations were visually similar to the original characters, making it humanimperceptible.For the example in the Yelp dataset, the adversarial example was crafted by only perturbing the most sensitive word "quality" to "quɑlity."Similarly, after perturbing the two most sensitive words "Painful" and "how" to "Painfu1" and "h0w," the adversarial example misled the BERT classifier to an incorrect prediction.These demonstrated the effectiveness of the adversarial examples crafted by the proposed XATA.

5) Adversarial Text Examples: Three randomly selected adversarial text examples produced by XATA-SHAP for BERT appear in Table
6) Comparison Results: a) Success Rate@N.The higher the success rate, the stronger the attack method.The results of different methods are summarized in Fig. 4. We set the perturbation upper bound to 2%-10%.Generally, both XATA-SHAP and XATA-LIME outperformed the baselines.As the same perturbation strategy was adopted, this indicated that the sensitivity obtained by SHAP or LIME enabled more effective attacks.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II ADVERSARIAL EXAMPLES FOR CLASSIFICATION CRAFTED BY XATA-SHAP
Fig. 4. Success rate results of different methods on three datasets; each row shows the results on one dataset.The IMDB dataset (a1-a4), Yelp dataset (b1-b4), and Amazon dataset (c1-c4) follow from the top; the ordinate of each sub-figure is the success rate.
Note that though the gradient-based methods like TextBugger and Gradient * Input were conducted in white-box, our methods were more effective than theirs.As the perturbation upper bound increases, the success rate of XATA increases faster than that of baselines.For example, in Fig. 4(a-1), with an upper bound of 2%, XATA-SHAP achieved a success rate of 52% for an LSTM classifier on IMDB.The baseline with the best performance was PWWS, with a success rate of 47.8%.Both XATA-SHAP (89.0%) and XATA-LIME (78.8%) were much higher than PWWS (70%), with the upper bound expanding to 10%.The advantage of XATA became more salient when more words were perturbed.This is likely due to our method of avoiding the issue of overlapping token sensitivities that deletion-based methods face.Alternatively stated, the use of additive feature attribution design resulted in a more impactful set of words to be perturbed.
b) Perturbation Rate: To further compare XATA with baselines, we compared the number of words each attack method had to perturb to fool the classifier successfully.For each example, if the attack was successful, the number of perturbed words was recorded; if the attack failed, the number of words in the whole example was recorded.The results of the perturbation rate of different methods on three datasets were summarized in Fig. 5.We discovered that XATA attained lower perturbation rates than the baselines, indicating superior performance.For example, to fool an LSTM classifier on IMDB (Fig. 5(a)), XATA-SHAP only needed to perturb 4.871% of the words, and XATA-LIME needed to perturb 7.628%.Their required perturbation rates were significantly lower than the best-performing baseline (Deep-WordBug), whose perturbation rate was 12.906%.The superiority of XATA is also evident for BERT (Fig. 5).For example, theperturbation rate of XATA-SHAP was 15.375% to fool the BERT on Amazon in Fig. 5(c), and the figure of XATA-LIME was 19.752%, about only half of the perturbation rate required by DeepWordBug.This indicated that the perturbations suggested by the proposed XATA were more effective in attacking target DL classifiers.c) Perturbation Impact@N: For a more fine-grained comparison, we calculated the perturbation impact of different methods under different upper bounds of perturbation.A higher perturbation impact meant a more significant effect on the prediction results of the target model.As shown in Fig. 6, XATA-SHAP generally achieved the maximum perturbation impact, and XATA-LIME was the second largest.Therefore, our methods have a more significant effect on the target model than the baselines.For example, the perturbation impact of XATA-SHAP for BERT on IMDB datasets was 0.627 in Fig. 6(a-4), with a perturbation upper bound of 10%.XATA-LIME's impact was 0.608, and the best-performing baseline (PWWS) was 0.498 in the same case.Similar to the above Success Rate@N comparison, the perturbation impact of XATA (SHAP and LIME) increased faster than those rates from baseline methods.For example, PWWS achieved the highest perturbation impact with an upper bound of 2% (0.312) for BERT on IMDB datasets in Fig. 6(a-4).However, XATA-SHAP and XATA-LIME overtook PWWS with an upper bound of 4%.When the upper bound was set to 10%, the perturbation impact of XATA-SHAP (0.627) and XATA-LIME (0.608) significantly outperformed PWWS (0.498).
The above observations further indicate that the overlapping effect hinders the word sensitivity obtained through deletion-based methods, such as PWWS.Though the whitebox setting improved gradient-based methods (TextBugger and Gradient * Input), our model still attained stronger performances.To summarize, by comparing perturbation impacts, we found that XATA had a more significant effect on the target model than benchmark methods and was thus more threatening to DL security.
7) Case Study: We further examined the advantage of XATA over gradient-based and deletion-based methods by detailing the attack procedures with examples.We selected the PWWS and Gradient * Input to represent gradient-based and deletion-based methods as they performed better than the methods in the same category.We illustrate an example of how the probability of the positive class was affected by the sequentially perturbing sensitive tokens identified by the attack methods from a data instance in the Yelp dataset in Fig. 7.The example was predicted as positive by the LSTM with a probability of 0.922.The top-5 sensitive words to perturb suggested by different attack methods are displayed in red and descending order.
For Gradient * Input, the probability of being positive increased from 0.586 to 0.873 when the fourth most sensitive word, "no," was perturbed.This canceled out the effects of the initial perturbation for "awesome," "yes," and "with."Consequently, the example was still positive (i.e., possibility > 0.5) when all five words were perturbed.PWWS suggested the two most sensitive words were "awesome" and login."However, perturbing them was less effective than perturbing "awesome" and "and," which was the fifth most sensitive word.This might be because the sensitivity of "awesome" was more overlapped with "login" than with "and," and hence the effect of perturbing "login" next was not as effective as perturbing "and."

B. XATA on Text Regression
1) Datasets: In text regression, the experiments were carried out on three datasets: My Personality [42], Drug Review, 6 and Fig. 6.The perturbation impact results of XATA and baselines on three datasets.The IMDB (a1-a4), Yelp (b1-b4), and Amazon dataset (c1-c4) follow from the top, the ordinate of each sub-figure is Perturbation Impact.A higher Perturbation Impact indicates a stronger attack method.CommonLit Readability. 7The first dataset is widely used to predict the Big Five personality traits, including openness, conscientiousness, extraversion, agreeableness, and neuroticism.The is based on user posts on social media such as Facebook and Microblog.There are five output values for each post.Each value indicates a specific personality.Each personality is represented by a continuous value ranging from 0 to 5. A higher value indicates that the personality is more prominent.The Drug Review dataset is obtained from online pharmaceutical review sites.It includes text reviews and corresponding satisfaction scores, ranging from 1 to 10.The CommonLit Readability dataset consists of text passages and the corresponding reading level scores ranging from −3.68 to 1.71.The higher the score, the more challenging it is to comprehend.The statistics of these datasets are summarized in Table III.

2) DL-Based Regression Model to Attack:
The state-of-theart BERT-based Linear Regression (BLR) model was selected as the attacked model.Specifically, the pre-trained BERT-baseuncased model was used to obtain the latent 768-dimension feature from the pooled output of the model.Then, the multiple linear regression with five dimensions was used for personality prediction on the My Personality dataset.The linear regression with one dimension was used for rating prediction on the Drug Review dataset and reading level prediction on the Common-Lit Readability dataset.The experimental platform and attack methods were the same as those in Section IV.A.2.
3) Evaluation Metrics: Consistent with previous regression studies [24], [42], we used mean square error (MSE) and mean absolute error (MAE) as metrics to evaluate the model performances.We compared MSE and MAE before and after the attack.Both MSE and MAE were calculated by comparing the model prediction F y (w ) and the original prediction F y (w).The MSE and MAE values were averaged over examples in the test set.We perturbed N% of the words to craft adversarial examples, and corresponding MSE and MAE were denoted as MSE@N and MAE@N, respectively.The higher the values, the more effective the attack method.

4) Adversarial Text Examples:
We randomly selected several adversarial examples and presented them in Table IV.Though the original predictions were close to the ground truths, adversarial examples led to deviated predictions.For instance, the original prediction for the legitimate example in the My Personality dataset was 3.78 in conscientiousness, but it dramatically changed to 2.90 when three words ("business," "beautiful," and "business") were perturbed.For the example from the Common-Lit Readability dataset, the absolute error was increased from 0.0299 (|−0.5036+ 0.4737|) to 0.1714 (|−0.6451+ 0.4737|) when one comma and two periods were replaced by an unusual punctuation mark "¸".

5) Comparison Results:
The comparison results are shown in Figs. 8 and 9. Consistent with the comparison in the classification task, the upper bound of the perturbation rate is set from 2% to 10%.The MSE@N and MAE@N are reported separately for each dimension.The MSE@N and MAE@N were computed for each personality on the My Personality dataset.
The superiority of XATA is evident based on two key observations.First, XATA-SHAP and XATA-LIME outperformed other attack methods in each personality prediction task, satisfaction prediction, and reading level prediction.For example, in the extraversion prediction in the My Personality dataset, the MSE@10 and the MAE@10 were 0.986 and 0.831 under the attack of XATA-SHAP, while the best performing baselines (i.e., DeepWordBug) were only around 0.795 and 0.739, respectively.
Second, the advantages of XATA-SHAP and XATA-LIME became more significant when a greater number of words could be perturbed.For example, in satisfaction prediction in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 8.The MSE@N and MAE@N of XATA and baselines on the my personality dataset; the perturbation upper bound ranges from 2% to 10%, and each personality is shown in a single figure.The observations are two-fold: 1) the proposed XATA-SHAP (red line) and XATA-LIME (blue line) outperformed other baselines, and 2) XATA-SHAP outperformed XATA-LIME.Fig. 9. MSE@N and MAE@N on the drug review (a1-a2) and the CommonLit readability (b1-b2).The observations are similar to Fig. 8.
We also found that XATA-SHAP outperformed the XATA-LIME in yielding more effective adversarial attacks in text regression.For instance, in the reading level prediction, the MSE@10 and MAE@10 were 1.174 and 0.929 for XATA-SHAP, which were significantly higher than XATA-LIME's MSE@10 of 1.115 and MAE@10 of 0.908, meaning XATA-SHAP misled the DL-based regression model to more deviated predictions than XATA-LIME.
6) Case Study: We illustrate an example of how the predicted value varies by sequentially perturbing sensitive tokens identified by the attack methods for a data instance from the Drug Review dataset in Fig. 10.The main observations were similar to the classification task.Gradient * Input failed to consider the direction of sensitivity, hence the prediction was decreased when "was" and "post" were perturbed.For PWWS, the two most sensitive words were "not" and "repeated," but they were much overlapped compared with "not" and "reaction."Thus, perturbing "not" and "reaction" had a more significant impact.These observations further demonstrated the superiority of our methods.

A. Application in White-Box Attack Scenarios
Though we only demonstrated how the LIME-or SHAPbased adversarial text attacks could be conducted in black-box scenarios, they can also operate in white-box attacks since accessing model details will not affect the process or the results of the explainable methods.Interestingly, we found that the proposed XATA in fact outperformed white-box attack baselines like TextBugger and Gradient * Input methods without using any model details.This is also understandable because explainable methods are initially proposed to give developers insights into the trained DL model.Even the developers who are fully aware of the model's details still need explainable methods to understand the trained model's rationale (e.g., token sensitivity).Additive feature attribution explainable methods can provide attackers with a more accurate sensitivity score for each token without relying on model details.

B. Contradiction Between Explanation and Adversarial Robustness
The experiments showed that the XATA-SHAP outperformed XATA-LIME.The results indicated that more advanced explainable methods (e.g., SHAP over LIME) could lead to more effective adversarial attacks.While more effective adversarial attacks can assess model robustness and security with greater accuracy, attackers can also leverage such methods to pose threats to the target models.
Increasingly, researchers are providing numerous ways to address the lack of explanation that DL models often display.Local post-hoc methods are one of the streams that have attracted the most effort.Additive feature attribution explainable methods are a type of prevailing local post-hoc methods.Improvement in explanation can provide attackers with approaches to estimate token sensitivity more accurately and effectively, thus enabling them to craft more targeted and threatful adversarial examples.As a result, such efforts will increase the risk of the model being attacked.The improvement in model explanation seems to lead to a decrease in adversarial robustness.Hence, there exists a contradiction (trade-off) between DL's explanation and adversarial robustness.This phenomenon requires additional attention and consideration.

VI. CONCLUSION
DL models have achieved tremendous success in text classification and text regression.However, they are also strikingly vulnerable to adversarial attacks.In this paper, we proposed a novel adversarial text attack method, XATA.The proposed XATA operates by (1) replacing the conventional gradient-based and deletion-based approaches for sensitivity estimation with additive feature attribution local post-hoc explainable methods (LIME and SHAP) to measure each word's sensitivity and (2) perturbing the words according to the sensitivity scores provided by the additive feature attribution explainable methods.Through a series of experiments, we demonstrated XATA's advantages over state-of-the-art attack baselines on DL models executing text classification or text regression.We discovered that more advanced additive feature attribution explainable methods (e.g., SHAP) could enable more effective adversarial attacks.These results suggest a contradiction (trade-off): models with more substantial explanatory power could lead to more impactful adversarial attacks.
There are several promising directions for future research.First, future studies can analyze other explainable methods for adversarial attacks beyond additive feature attribution explainable methods.Second, we adopted only the visually similar character replacement perturbation strategy.However, different word-or character-level perturbation strategies (e.g., insert, flipping, removal) can be introduced.Third, this study focused on adversarial text attacks, but future studies can apply the methods to other domains like computer vision or speech recognition to test its generalizability.Fourth, future studies could use the proposed XATA to generate adversarial examples that could potentially be inputted into downstream tasks such as assessing model robustness and augmenting datasets.

Abstract-
Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks.However, DL models are often strikingly vulnerable to adversarial attacks.Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assuming no model knowledge is accessible to attackers).These attacks typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity.However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing token directionality and overlapping token sensitivities, respectively.In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression.We evaluated XATA's attack performance on DL models executing text classification on the IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity datasets and DL models conducting text regression on the My Personality, Drug Review, and Common-Lit Readability datasets.The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks.These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks.

Fig. 1 .
Fig. 1.An example adversarial attack on a DL-based toxic content text classifier.The adversarial example evades the classifier when an attacker swaps a few characters in the original detected content [9].

) 3 )
Limitations of Existing Methods: Despite their value, existing methods have several limitations.Since gradient-based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 2 .
Fig. 2. Connections between local post-hoc methods and black-box adversarial text attacks.

1 )
Phase 1: Additive Feature Attribution Explainable Methods for Sensitivity Estimation: Phase 1 aims to compute the sensitivity of each token s y w n , ∀n via an explanation model.Therefore, the explanation model acts as the surrogate model in the black-box attacks, i.e., the Fy is represented by F E .As φ n in(10) is the same as token sensitivity s y w n , we use s y w n for consistency.We use w to denote a perturbed example of w (e.g., removing w n yields w = [w 1 , . . .w n−1 , w n+1 , . . ., w N ]).

r Step 1 :r 2 :r 3 :r 4 :
Generating a collection W = [ w1 , w2 , . . ., wM ] by randomly removing some tokens in w.Step Feeding the W to F y to get predicted probabilitiesỸ = [F y ( w1 ), F y ( w2 ), . . ., F y ( wM )].Step Transform wm to a mask vector ṽm , ∀m.If a word is removed in wm , its corresponding vector dimension in ṽm becomes 0, and 1 otherwise.Therefore, ṽm has the same length as wm .The same transformation is applied on w to get v. Step Measure the proximity between v and ṽm to getweights π m LIME = π LIME (v,ṽm ), where π LIME (v, ṽm ) is an exponential kernel with width σ. ||v|| 2 denotes the L 2 norm, i.e., ||v|| 2 = n=N n=1 v 2

Fig. 5 .
Fig. 5. Perturbation rate results of different methods on three datasets; the lower the Perturbation Rate, the stronger the attack method.

Fig. 7 .
Fig. 7.A case study example that shows how the predicted probability was affected by the sequentially perturbing sensitive tokens identified by gradientbased and deletion-based methods compared to our proposed XATA attack method for text classification.

Fig. 10 .
Fig. 10.A case study example that shows how the predicted value was affected by the sequentially perturbing sensitive tokens identified by gradient-based and deletion-based methods compared to our proposed XATA attack method for text regression.

TABLE 3 DATASET
SUMMARY FOR THE TEXT REGRESSION TASK

TABLE IV ADVERSARIAL
EXAMPLES FOR REGRESSION CRAFTED BY XATA-SHAP