loading page

Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression
  • +4
  • Yidong Chai ,
  • Ruicheng Liang ,
  • Hongyi Zhu ,
  • Sagar Samtani ,
  • Meng Wang ,
  • Yezheng Liu ,
  • Yuanchun Jiang
Yidong Chai
Author Profile
Ruicheng Liang
Hefei University of Technology, Hefei University of Technology

Corresponding Author:[email protected]

Author Profile
Hongyi Zhu
Author Profile
Sagar Samtani
Author Profile
Meng Wang
Author Profile
Yezheng Liu
Author Profile
Yuanchun Jiang
Author Profile

Abstract

Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks. However, DL models are often strikingly vulnerable to adversarial attacks. Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assumes no model knowledge is accessible to attackers) that typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity. However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing the directionality of tokens and overlapping token sensitivities, respectively. In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression. We evaluated XATA’s attack performance on DL models executing text classification on three datasets (IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity) and DL models conducting text regression on three datasets (My Personality, Drug Review, and CommonLit Readability). The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks. These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks.
2023Published in IEEE Transactions on Knowledge and Data Engineering on pages 1-14. 10.1109/TKDE.2023.3270581