Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression
Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks. However, DL models are often strikingly vulnerable to adversarial attacks. Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assumes no model knowledge is accessible to attackers) that typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity. However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing the directionality of tokens and overlapping token sensitivities, respectively. In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression. We evaluated XATA’s attack performance on DL models executing text classification on three datasets (IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity) and DL models conducting text regression on three datasets (My Personality, Drug Review, and CommonLit Readability). The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks. These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks.