Local Post-hoc Explainable Methods for Adversarial Text Attacks

Deep learning models have significantly advanced various natural
language processing tasks. However, they are strikingly vulnerable
to adversarial text attacks, even in the black-box setting where no model
knowledge is accessible to hackers. Such attacks are conducted with a two-phase
framework: 1) a sensitivity estimation phase to evaluate each element’s
sensitivity to the target model’s prediction, and 2) a
perturbation execution phase to craft the adversarial examples based on estimated
element sensitivity. This study explored the connections between the local
post-hoc explainable methods for deep learning and black-box adversarial text
attacks and proposed a novel eXplanation-based method for crafting
Adversarial Text Attacks (XATA). XATA leverages local post-hoc explainable
methods (e.g., LIME or SHAP) to measure input elements’ sensitivity and adopts the word replacement perturbation strategy to
craft adversarial examples. We evaluated the attack performance of the proposed
XATA on three commonly used text-based datasets: IMDB Movie Review, Yelp Reviews-Polarity,
and Amazon Reviews-Polarity. The proposed XATA outperformed existing baselines in
various target models, including LSTM, GRU, CNN, and BERT. Moreover, we found
that improved local post-hoc explainable methods (e.g., SHAP) lead to more
effective adversarial attacks. These findings showed that when researchers constantly
advance the explainability of deep learning models with local post-hoc
methods, they also provide hackers with weapons to craft more targeted and dangerous adversarial attacks.


INTRODUCTION
eep learning (DL) models have achieved tremendous success in various natural language processing (NLP) tasks like sentiment analysis [1], misinformation detection [2], and question answering [3]. They regularly achieve new benchmark scores or even outperform human experts. These incredible achievements motivate researchers and practitioners to deploy deep learning models in the real world.
However, like in other domains (e.g., computer vision), deep learning models are strikingly vulnerable to adversarial attacks in text applications [4]- [6]. These attacks trick the models into making attacker preferred outcomes by manipulating data examples with human-imperceptible perturbations. The original data example is called legitimate example while the example crafted from manipulation is called adversarial example [7]. Li et al. [7] found that though a DNN-based classifier can achieve a 92% accuracy in detecting toxic content, its accuracy drops to only 22% when faced with adversarial toxic content examples (Fig.  1). Therefore, adversarial attacks have received increasing attention from researchers as they can help assess model robustness and security against potential attacks in the real world [8]- [10]. An example of an adversarial text attack. When a hacker swaps a few characters in the original legitimate example, the toxic content that was previously detected will successfully fool the well-performing toxic content detector [7].
Adversarial text attacks can be categorized into two types based on how much a hacker can access the target model's details (e.g., structure and parameters) [6], [11]. The first type is white-box attacks methods, where hackers know model details. The adversarial examples crafted in the white-box scenario demonstrate the worst-case attacks against a DL model [4]. However, it is often unrealistic to assume that the model knowledge is accessible to hackers. Therefore, black-box attacks have attracted significant attention. The black-box attacks assume hackers have no access to model details and can only query the target model to infer model information [12].
While adversarial text attacks are implemented differently (e.g., DeepWordBug [5], TextFooler [13], and PWWS [14]), the attacks can be generally characterized by a twophase framework. The first phase measures the sensitivity D ---------------- Please note that all acknowledgments should be placed at the end of the paper, before the bibliography (note that corresponding authorship is not noted in affiliation box, but in acknowledgment section).
of the prediction change to each input token (e.g., word or character), while the second phase crafts effective perturbed adversarial examples based on token sensitivity. Understandably, the effectiveness of adversarial examples largely depends on the accuracy of the sensitivity estimation. Identifying sensitive tokens accurately is the foundation for a successful perturbation execution in the second phase [13].
Existing methods for sensitive token estimation include gradient-based and deletion-based methods. The former type uses the gradients of model prediction to the tokens to compute sensitivity. They usually fail to highlight the tokens that negatively contribute to the model prediction [15], leading to a compromised sensitivity estimation. The latter type deletes a token from the example and computes token sensitivity based on model outcome difference before and after deleting the token. However, this type of methods failed to consider overlapping effects between tokens. Consequently, the set of the identified sensitive tokens that will be perturbed in Phase 2 may not be optimal. Token sensitivity can also be computed by explainable DL methods [8], [16]. Explainable DL is an emerging branch of machine learning that aims to access DL models' decision processes. Notably, local post-hoc explainable methods reveal each token's role (i.e., sensitivity) to a model's predicted outcomes [17]- [20]. The prevailing local post-hoc explainable methods include LIME (Local Interpretable Model-agnostic Explanations) [17] and SHAP (Shapley Additive exPlanations) [18]. LIME is proposed to explain the target DL model by training an explainable model (e.g., linear model) for local approximation to discover tokens' importance (i.e., sensitivity) for interpretation. SHAP computes the Shapley values with insights from cooperative games to show each token's contribution (i.e., sensitivity). Hence, local post-hoc explainable methods like LIME and SHAP shared the same goal with the sensitivity estimation task in adversarial attacks. Moreover, these local post-hoc explainable methods are model-agnostic (i.e., explained models can be any type e.g., LSTM, GRU, CNN, or BERT). Thus, they can be used as weapons to craft adversarial examples in black-box attacks.
In light of these, this study proposes an eXplanationbased method for crafting Adversarial Text Attacks (XATA). Specifically, we use local post-hoc explainable methods (i.e., LIME and SHAP) to measure the sensitivity of each word token to the target model prediction. We perturb the tokens according to the sensitivity scores provided by the explanation methods. We adopt the commonly used visually similar character replacement perturbation strategy proposed by [5] in the second phase to compare with other attack methods. For simplicity, we name the LIMEbased attack and SHAP-based attack methods as XATA-LIME and XATA-SHAP, respectively. Compared with the gradient-based techniques, the proposed XATA can highlight tokens that contribute negatively. Meanwhile, the sensitivities of all tokens are simultaneously determined with one model (e.g., the explanation model), thus addressing the overlapping effects confronted by the deletion-based methods.
We performed experiments on three widely used textbased datasets: IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity. The attacked models include LSTM [21], GRU [22], CNN [23], and BERT [24]. We find that the XATA-LIME and XATA-SHAP can craft more effective adversarial examples compared with baseline attack methods. On top of that, XATA-SHAP is more effective than XATA-LIME. As an extension to LIME, SHAP is more accurate in explanation [18]. Hence, the advantages of XATA-SHAP over XATA-LIME can be attributed to a more accurate sensitivity estimation for each word token. This implies that an improved explainable method can lead to more threatful adversarial attacks. These findings confirm a contradiction between explanation and adversarial robustness, and we briefly discuss this phenomenon in the section part of this paper.
The contribution of this study is three-fold: • First, we propose a new method to craft adversarial text examples for black-box attacks. The proposed attack method outperforms existing baselines in various target models on multiple datasets. Hence, it can be used to better assess the security and vulnerability of deep learning models. • Second, this study reveals the connections between black-box adversarial attack methods and local posthoc explainable DL methods. Both streams aim to estimate the token sensitivity, operate on the individual example level, and are model agonistic. • Last but not least, we empirically demonstrate the contradiction (trade-off) between explainability and adversarial robustness in DL models. When researchers constantly advance the explainability of DL models, they also provide hackers with tools to craft more targeted and effective adversarial attacks. This phenomenon necessitates attention from our community. The remainder of the paper is organized as follows. Section 2 reviews the related work on adversarial text attacks. Section 3 discusses the motivations and elaborates on the details of the proposed attack method. The following section describes the evaluation, including experiment design and results. We discuss and conclude the paper in Sections 5 and 6, respectively.

RELATED WORK ON ADVERSARIAL TEXT ATTACKS
Adversarial attacks were discovered initially in the computer vision domain. Goodfellow et al. [4] found that small human-imperceptible perturbations to natural images will cause DL classifiers to miscategorize a panda image as a gibbon. Adversarial attacks have also become a hot research area in text-based applications such as misinformation detection [7], [25], [26] and sentiment analysis [27]- [29]. Adversarial attacks can be divided into white-box attacks and black-box ones [30]. White-box attacks are performed when model details like architectures and parameters are accessible to hackers [31]. In contrast, the hackers conducting black-box attacks are unaware of model details. Still, they can infer the target model information by querying it [12]. Many studies have created different blackbox attack methods, as black-box models are pervasive and are more realistic [32], [33].
Mathematically, given a legitimate example consisting of a sequence of tokens ( = [ 1 , … , , … , ]), the hackers are accessible to the model's prediction, denoted as , i.e., ℱ( ) = . Hackers can also access the predicted probability for class , i.e., ℱ ( ). A black-box adversarial text attack aims to craft an adversarial example A that misleads the model to a preferred prediction ̂, different from initial prediction . Due to the discrete nature of text data, the text examples are restricted in a space . The crafted example A should be similar to the original example . We use ( , ′ ) to denote a domain-specific similarity function : × → ℝ + . The required minimum similarity between A and is . Then, the attack task can be formally denoted as: , ′ ) ≥ Since the optimization is hard to solve, past studies adopted a two-phase solution to craft the adversarial examples heuristically (Fig. 2). The first phase is the sensitivity estimation which measures the sensitivity of the prediction change to each input token (e.g., word or character). According to token sensitivity computed in the first phase, the second phase is perturbation execution. The perturbation of a highly-sensitive token will bring about a more significant outcome change than a slightly-sensitive one. Hence, the most sensitive tokens are perturbed to craft adversarial examples. Formally, denotes the sensitivity score of towards the predicted class . Phase 1 aims to estimate the sensitivity score for each word token , ∀ with a mapping function , as shown in Equation (2).
We use to denote the collection of , ∀ , i.e., = [ 1 , 2 , … , ]. Phase 2 aims to craft an adversarial example A that can mislead the model. We denote the function of perturbation as (•), and A is crafted by A = (ℱ, ,̂, ).
Identifying sensitive tokens in Equation (2) lays the foundation for executing effective perturbation in Equation (3). Two types of methods have been proposed to solve Equation (2): gradient-based and deletion-based.
Proposed by Papernot et al. [31], gradient-based methods compute the gradients of model outcome to the input token. As a larger gradient score means the token is more sensitive and vice versa, the gradient score is used as the sensitivity score for the subsequent phase [7], [34], [35].
However, as hackers cannot access target model details, a surrogate model is trained to approximate the target, and token sensitivity is then estimated from the surrogate model [36]. Formally, F and F denote the counterparts of the surrogate model, e denotes the embedding vector of . Sensitivity score is calculated with Equation (4): where ∇ e is the gradient of e and ‖•‖ 2 is the 2 norm.
As the actual token value indicates how strongly a token is expressed, the Equation (4) is extended by multiplying the gradient with the token embedding [37], given by: Gradients can be obtained via backpropagation. However, as the widely used ReLU activation function zeroes out negative signals during backpropagation, this method usually fails to highlight tokens that negatively contribute to the outcome [15]. Consequently, the estimated score may not precisely measure the token sensitivity.
Deletion-based methods delete a token from the example and query the target model with the new data example. The model outcome difference before and after deletion reflects the sensitivity of this token. Formally, the example after deleting the token is denoted as as \ = [ 1 , … , −1 , +1 … , ]. DeepFool [6] computes the differences of the predicted probability for class to represent sensitivity as: DeepWordBug [5] extends the computation by considering the sequentiality of input. For each token , they defined the Temporal Head Score (THS) as the outcome difference between two heading parts of an example, while the Temporal Tail Score (TTS) quantified the difference between two tailing parts of the example. The weight of TTS is control by a hyperparameter . The calculation is given by: PWWS [14] combines token saliency and predicted probability. Token saliency is defined same as Equation (6), except that is replaced by an unknown token instead of being deleted. Δ * represents the maximum change of probability after is replaced with different strategies, the sensitivity is calculated by Equation (8): TextFooler [13] further extends the computation by considering the predicted class, given by Equation (9): if ℱ( ) = , ℱ( \ ) =̂, and ≠̂.
While valuable, deletion-based methods only compute the sensitivity score of one token. However, multiple tokens may need perturbation to craft an adversarial example. Although one can perturb the tokens sequentially based on their sensitivity score, this may not be effective due to the overlapping effects between token sensitivity. For instance, assuming the score of three tokens 1 , 2 , 3 are 0.2, 0.15, and 0.1, respectively. The total effects of perturbing 1 and 2 may not be more significant than perturbing 2 and 3 because the effects of 1 may overlap with 2 while the effects of 2 and 3 are less overlapped. This problem occurs because the independent estimation fails to consider the joint effects between tokens. One solution is to estimate the sensitivity of tokens with one model simultaneously.

EXPLANATION-BASED ADVERSARIAL ATTACKS
We first describe explainable DL studies to explain why they can be leveraged to craft adversarial text examples. Then, we describe the procedure of creating adversarial text attacks with explainable methods. We name this type of attack method as eXplanation-based Adversarial Text Attacks (XATA) for simplicity.

Insights from Explainable DL Studies
DL models are criticized for lacking interpretability [38], [39]. It prevents developers from making informative improvements to the models, reduces people's trust in them, and ultimately hinders DL developments. Hence, many studies are conducted to improve DL explanations, and this stream of research is called explainable DL [16], [40], [41].
Explainable DL methods can be broadly grouped into two categories: the global and the local explainable methods, depending on the scope of the explanation [8]. Global explainable methods enable people to inspect and visualize the model structures and parameters. In contrast, local explainable methods focus on the prediction rationale for an individual example. They try to figure out the role of each token in the example. The role of each token is represented by a score that can be used to reflect the sensitivity. The mathematical description is the same as Equation (2). Hence, the local explainable methods shared the same task with Phase 1 in adversarial attacks. From Equation (3), the only information required from Phase 1 is the token sensitivity , ∀ = 1, … , . Hence, the outcome of the local explanation completely satisfies the requirement of Phase 2. Consequently, local explainable methods could be leveraged for adversarial attacks.
Based on when the explanation is obtained, the local explainable methods can be further divided into intrinsic and post-hoc methods [40]. Compared with intrinsic methods that design self-explainable models to offer explanation, the post-hoc methods introduce an explanation model (e.g., linear regression) as a second model to locally approximate the target model. The post-hoc methods require no access to the model knowledge for an explanation. Thus, local post-hoc methods apply to any DL models, satisfying the model-agnostic requirement of black-box attacks (i.e., the attacks can successfully conducted for any type of target model). Hence, local posthoc methods are suitable for black-box attacks. Fig. 3 shows the connections between local post-hoc methods and black-box adversarial attacks.

Proposed Adversarial Text Attack Method
Consistent with prior studies, the proposed explanationbased adversarial text attack methods also have two phases. Phase 1 is sensitivity estimation which leverages the posthoc explainable techniques to compute the sensitivity of each input token. Phase 2 is perturbation execution which conducts a perturbation according to token sensitivity from the post-hoc explainable methods. The details of the method are shown in Fig. 4. Consistent with prior black-box attack settings [8], [13], [14], [28], [42], the hacker can query the target model with a text example consisting of a sequence of tokens ( = [ 1 , 2 , … , ]). The model will return the predicted class as , i.e., ℱ( ) = , as well as the predicted probability for class ℱ ( ).

Phase 1: Sensitivity Estimation
Phase 1 aims to compute sensitivity for each token , ∀ via an explanation model. Hence, the explanation model act as the surrogate model in the black-box attacks. As a type of explanation models, the additive feature attribution models define the explanation model as a linear model and assume the contributions of each token are additive. We use the additive feature attribution models as the explanation model because they can well explain complex DL models, as shown in [18]. We use ̃ to denote a perturbed example of , e.g., ̃= [ 1 , 2 , 4 , … −1 , +1 , … , ] . Consist with prior notations, we use F( ) and F ( ) to denote the predicted class and the predicted probability for class of the explanation (surrogate) model, and = [ 1 , 2 , … , ]. Then, where is the bias for the linear model and ( ,̃) shows the existence of token in ̃, given by In order to obtain a reliable explanation model with local fidelity, optimization is involved in computing the model parameters , which include the and . The sensitivity scores (i.e., , ∀ ) are determined simultaneously and the sensitivity determination of any token has considered the sensitivity of all other tokens. Hence, the sensitivity computed by our method can better reflect the impact of each token than the deletion-based approach. Moreover, the sensitivity is additive, meaning the total impact of a set of tokens equals the sum of the sensitivity score of each token. This can alleviate the overlapping effects confronted by the deletion-based methods. Meanwhile, our method also considers the fact that a token may negatively contribute to the outcome because the sensitivity score can be negative. This enables our method to provide a more accurate estimation than the gradient-based methods.
With , we sort the tokens based on sensitivity scores, given by: where S is the function that sorts its tokens in descending order and is the index vector.

Phase 2: Perturbation Execution
As previously mentioned, the only required information from Phase 1 by Phase 2 is token sensitivity scores. Hence, most perturbation strategies [5], [6], [13], [14], [34], [42] and common perturbations (e.g., insert, removal, or replacement) from prior studies apply to our method. Furthermore, they can operate on both word-or character-level, as the local post-hoc explainable methods can explain models on either level, depending on the perturbation granularity of ̃. We select the visually similar character replacement strategy proposed by [5], where word tokens are replaced with other visually-similar characters. For instance, "o" can be changed to "0," and "1" can be changed to "l" when crafting the adversarial examples. Such perturbation strategy is human-imperceptible, thus retaining the semantic as the legitimate examples. It is also widely adopted for crafting adversarial text examples by prior studies [7], [43]. The perturbation is conducted according to their sensitivity scores. We first generate the replacement for the most sensitive words 1 , given by where P is the perturbation function. Then, we replace 1 with 1 ′ to obtain the crafted example ′ as: If the crafted example cannot mislead the target model, we repeat the same perturbation to the second sensitive word 2 , the third sensitive word 3 , and so on until the crafted example can successfully mislead the target model to make a prediction ̂ (̂≠ ). In this way, we obtain adversarial textual examples A .
Our proposed explainable adversarial attack framework can leverage most local post-hoc explainable methods. As LIME and SHAP are two representative local posthoc explainable methods, we elaborate on using LIME and SHAP for adversarial text attacks in detail.

LIME-based Attack: XATA-LIME
We performed the following five steps to estimate token sensitivities for attacks: + Ω( ), (15) where Ω( ) is the number of non-zero weights to ensure explanation. Optimization techniques like gradient-descent algorithms can be adopted to determine model parameters, which include sensitivity score for each token. Then, Phase 2 is executed as mentioned in Section 3.2.2.

SHAP-based Attack: XATA-SHAP
The token sensitivity scores , ∀ computed from SHAP explainable methods are called Shapley values. The process of XATA-SHAP is similar to XATA-LIME, except that the Shapley kernel is used in Step 4 to calculate the proximity. Specifically, Accordingly, optimal parameters of the surrogate model is given by: The Shapley kernel guarantees nice properties for sensitivity estimation (e.g., local accuracy, missingness, and consistency [18]), thus estimating a more accurate token sensitivity than LIME.
Moreover, premised on cooperative game theory, Shapley value explains the contribution of each token towards model outcomes. Shapley value is highly informative because it predicts how the model will behave without each token. For example, if the Shapley value for a specific word is , the model outcome will be reduced by theoretically with this word masked. Hence, the hackers can craft adversarial examples guided by Shapley values. The pseudocode of the explanation-based (LIME/SHAP based) adversarial text attack method is shown in Algorithm 1.

Baseline Attack Methods
We compared the proposed XATA with existing text attack baseline methods. To ensure a fair comparison, we use the visually similar characters replacement strategy introduced in Section 3 for all baselines. Hence, the performance differences are completely attributed to the sensitivity estimation. The baseline methods included TextBugger [7], Gradient*Input [37], DeepFool [6], DeepWordBug [5], PWWS [14], and TextFooler [13], each of which has been described in Section 2. The token sensitivity was computed according to Equation (4), (5), (6), (7), (8) and (9) respectively. Note that the gradient-based methods required to train a surrogate model. However, the training process may introduce extra biases, making it hard for fair comparison. Hence, we used the target model itself as surrogate model. In other words, the gradient-based methods were conducted in white-box. Consistent with prior studies [5], [7], [13], [14], [45], we also included the random-based method. We randomly assigned a sensitivity score to each word in the example and repeatedly perturbed the word with the highest score until the target model is fooled.

Attacked Deep Learning Models
Consistent with prior adversarial attack studies [5], [14], [28], [44], we verified the effectiveness of the proposed XATA on three types of deep learning-based classifiers: RNN, CNN, and BERT. The details are as follows: • RNN (LSTM, BiLSTM, GRU, BiGRU). The first layer is an embedding layer with an embedding matrix. We used the pre-trained 100-dimension GloVe 4 word embedding by Pennington et al. [19] to transform the discrete inputs to dense vectors. The second layer is the RNN layer (e.g., LSTM) with 128 hidden nodes. The final layer is a fully connected layer for classification. • CNN (CNN-2, CNN-3, CNN-4). The first layer is the same as RNN-based classifiers, which map discrete textual examples to vectors. In the second layer, filters of different sizes are used for convolution operation.
For each filter, the width can be adjusted, and the length is fixed as embedding dimension. We use two different width filters (3 and 4) on CNN-2, three (3, 4 and 5) on CNN-3, and 4 (2, 3, 4 and 5) on CNN-4; the channel number for each filter is 100. We then apply a max-pooling operation for feature vectors obtained by the previous convolution operation and use a fully connected layer for classification. • BERT. We used the pre-trained BERT-base-uncased version 5 , a 12-layer BERT with 768 hidden units and 12 heads. We add a fully connected layer for classification and then finetune BERT on our datasets. All models are trained based on the word level. We adopted Adam Optimizer [46] to optimize the parameters. The learning rates are 5e-3 for RNN and CNN and 5e-5 for BERT. All models are trained with a hold-out test strategy in original training data (i.e., 80% for training and 20% for validation) and test on the testing data for each dataset. All the models are implemented based on Pytorch. The experiments are conducted on a GPU server with Intel® Xeon® Gold 6226R CPU @ 2.90GHz and two NVIDIA GeForce RTX™ 3090 with 24GB GDDR6X.

Evaluation Metrics
To verify the effectiveness of the proposed text attack method, we crafted adversarial examples for the examples in the test set. We adopted four evaluation metrics commonly used in adversarial attack studies; each was averaged over the examples [7], [13], [42], [47].  (1)) was relaxed. Hence, this metric represents the attack performance without perturbation upper bound.

Attacked Accuracy
We executed XATA-SHAP and XATA-LIME on three datasets (IMDB, Yelp, and Amazon). The target models, including RNN, CNN, and BERT, can get high accuracy in test set (called original accuracy). For instance, the original accuracies of all models were more than 92% on Yelp. Table  2 compared the original accuracy with the attacked accuracy. The lowest accuracies are bold-faced. We observed that the accuracy of all classifiers decreased dramatically after being attacked with XATA. Even for BERT, whose original accuracy was over 90% on all three datasets (91.79 % on IMDB, 94.68% on Yelp, and 95.19% on Amazon), the attacked accuracy was nearly 0%. Meanwhile, we found XATA-SHAP to be more threatening than XATA-LIME because SHAP usually has a lower attacked accuracy than LIME. For example, BiLSTM has an attack accuracy of 7.55% on IMDB when attacked by XATA-LIME, but its accuracy drops to 0 when attacked by XATA-SHAP. The results indicated that adversarial examples generated by XATA successfully fooled DL classifiers. The proposed XATA (regardless of XATA-LIME or XATA-SHAP ) is effective for text attacks, and XATA-SHAP is more effective than XATA-LIME. classified as negative with a probability of 99.9%. XTAT-SHAP offered the top 5 sensitive words, which included "avoid", "fails", "flat", "tries" and "fails". Perturbation was executed for each word sequentially, and the decision was altered after perturbing the first three words. Particularly, the adversarial example was successfully crafted by changing "AVOID" to "AV0ID", "fails" to "f ɑ ils" and "FLAT" to "F1AT". Such perturbations were visually similar to the original characters, making the perturbation human-imperceptible. For the example in the Yelp dataset, the adversarial example was crafted by only perturbing the most sensitive word "quality" to "quɑlity". Similarly, after perturbing the two most sensitive words "Painful" and "how" to "Painfu1" and "h0w", the adversarial example misled the BERT classifier to a wrong prediction. This demonstrates the effectiveness of the adversarial examples crafted by the proposed XATA. ", "This book has to be one of the most tedious works of literature ever written. Hawthorne is a great writer, but I don't know this book made it into that sacred list we call "classics". Perhaps on the merit of his name alone?" Painful, how, has, name, list Negative (99.9%)

Adversarial example
Text: "Painfu1", "This book has to be one of the most tedious works of literature ever written. Hawthorne is a great writer, but I don't know h0w this book made it into that sacred list we call "classics". Perhaps on the merit of his name alone?" Positive (97.5%)

Success Rate@N Comparison
We evaluated the performance of XATA under perturbation upper bound by comparing XATA with other text attack baselines. The higher the success rate, the stronger the attack method. The results are summarized in Table 4. The best performance is bolded, while the second-best performance is underlined.   We set the perturbation upper bound to 2%-10%. Generally, both XATA-SHAP and XATA-LIME outperformed baselines. As the same perturbation strategy was adopted, this indicated that the sensitivity obtained by SHAP or LIME enabled more effective attacks. Note that though the gradient-based methods like TextBugger and Gradient*Input were conducted in white-box, our methods were more effective than them. This demonstrated the effectiveness of our mothods. Besides, as the perturbation upper bound increases, the success rate of XATA increases faster than that of baselines. For example, with an upper bound of 2%, SHAP achieves a success rate of 52% for an LSTM classifier on IMDB. The best performing baseline was PWWS, with a success rate of 47.8%. Both XATA-SHAP (89.0%) and XATA-LIME (78.8%) are much higher than PWWS (70%), with the upper bound expanding to 10%. Hence, the advantage of our methods becomes more obvious when increasing number of words were perturbed. This was because our methods considered the overlapping effects, and thus the set of words to be perturbed in our methods were more impactful than baselines. Overall, the proposed XATA achieved effective attacks by identifying the most sensitive words, which is more accurate than baselines.

Perturbation Rate Comparison
To further compare XATA with baselines, we compare the number of words that have to be perturbed when the attack method successfully fooled the classifier. For each example, if the attack is successful, the number of perturbed words is recorded; if the attack fails, the number of words in the whole example is recorded. The results of the perturbation rate of different methods on three datasets are summarized in Table 5. The best performance is bolded, while the second-best performance is underlined.
Since a lower perturbance rate indicates a higher attack example quality, we conclude that XATA is superior to baselines. For example, to fool an LSTM classifier on the IMDB dataset, XATA-SHAP only needs to perturb 4.871% (11.342) of the words and XATA-LIME needs to perturb 7.628% (17.759). Their required perturbation rate is significantly smaller than the best-performing baseline (Deep-WordBug) whose perturbation rate was 12.906% (30.048). The superiority of XATA is also evident for BERT. For example, the perturbation rate of XATA-SHAP is 15.375% (12.436) to fool the BERT on Amazon, and that of XATA-LIME is 19.752% (15.976), which is about only half of the perturbation required by DeepWordBug. This indicates that XATA's suggested perturbations were more effective in attacking the target model successfully.

Perturbation Impact@N Comparison
For a more fine-grained comparison, we calculated the perturbation impact of different methods under different perturbation upper bounds. A higher perturbation impact means a more significant impact on the prediction results of the target model. As shown in Table 6, XATA-SHAP generally achieves the maximum perturbation impact, and XATA-LIME is the second largest. Hence, our methods have a more significant impact on the target model than baselines. For example, the perturbation impact of XATA-SHAP for BERT on IMDB datasets is 0.627, with a perturbation upper bound of 10%. XATA-LIME's impact was 0.503, and the best-performing baseline (PWWS) was 0.498 in the same case. Similar to the success rate@N comparison in Section 4.3.2, the perturbation impact of XATA (SHAP and LIME) increases faster than that of baselines. For example, PWWS achieves the highest perturbation im-pact@2% (0.312) for BERT on IMDB datasets, but it is overtaken by XATA-SHAP and XATA-LIME at perturbation impact@4%. When the upper bound was allowed to be 10%, XATA-SHAP (perturbation impact as 0.627) and XATA-LIME (perturbation impact as 0.608) significantly outperformed PWWS (perturbation impact as 0.498). These findings further indicate that the word sensitivity obtained through deletion-based methods, such as PWWS, are compromised by overlapping effects between tokens. Meanwhile, even the white-box setting gave the gradient-based methods (TextBugger and Gradient*Input) advantages, our model still outperformed them. To sum up, by comparing perturbation impacts, we found that XATA resulted in a more significant impact on the target model than other methods and is more threatening to deep learning security, consistent with the previous comparison conclusion.

Application in White-box Attack
Though we only demonstrated how the LIME-or SHAPbased adversarial text attacks could be conducted in blackbox scenarios, they can also operate in white-box because accessing model details will not hinder the process nor the results of the explainable methods. Hackers can craft adversarial examples, ignoring the knowledge of model details. Interestingly, we found the proposed XATA, without using any model details, outperformed white-box attack baselines like TextBugger and Gradient*Input methods. This is also understandable because explainable methods are initially proposed to give developers insights into the trained complex DL model. Even though the developers are fully aware of model details, they still need explainable methods to understand the model rationale, such as the token sensitivity. Hence, local post-hoc explainable methods can provide hackers with a more accurate sensitivity score for each token without relying on model details.

Contradiction between Explanation and Adversarial Robustness
The experiments showed that the XATA outperformed all the existing baselines. Besides, an improved explainable method like SHAP can lead to more effective adversarial attacks. On one hand, more effective adversarial attacks can be used to evaluate the model robustness and security more accurately. On ther other hand, hackers can also leverage such methods to pose threats to the target models. The lack of explanation motivates researchers to provide a myriad of ways to address this problem. The local post-hoc methods are one of the streams that have attracted most efforts. The improvement of explanation provides hackers weapons to estimate token sensitivity more accurately and effectively, thus enabling them to craft more targeted and threatful adversarial examples. As a result, such efforts will increase the model's risk of being hacked. The improvement in model explanation seems to lead to a decrease in adversarial robustness. Hence, there seem be a contradiction (trade-off) between explanation and adversarial robustness for DL. This phenomenon necessitates additional attention and consideration.

CONCLUSION
Deep learning models have achieved tremendous success in various natural language processing tasks. However, they are also strikingly vulnerable to adversarial attacks. In this paper, we proposed a new method for adversarial text attacks. We were motivated by the connections between the local post-hoc explainable deep learning methods and the sensitivity estimation phase in adversarial text attacks. We used two local post-hoc explainable methods (i.e., LIME and SHAP) to measure the sensitivity of each word and then perturb the words according to the sensitivity scores provided by the explanation methods. Based on the experiment results on three commonly used datasets, we demonstrated the advantages of the proposed attack methods (XATA) against state-of-the-art baselines across various target models, including LSTM, GRU, CNN, and BERT. We also discovered that an improved explainable method (i.e., SHAP) can enable more threatful attacks. This may imply a contradiction between explanation and adversarial robustness and necessitates attention from our community. There are several promising directions for future research. First, this study focuses on the local post-hoc methods. Future studies can analyze other explainable methods like global methods or local intrinsic methods for adversarial attacks. Second, we only adopt the visually similar character replacement perturbation strategy. However, different word-or character-level perturbation strategies (e.g., insert, flipping, removal) can be introduced. Third, this study centered on adversarial text attacks. We can apply the methods to other domains like computer vision to test its generalizability. Fourth, this study concentrated on the binary classification where the targeted attack is same as the untargeted attack. Future studies can extend the XATA to other classification tasks like multi-class and multi-label classification.