loading page

Local Post-hoc Explainable Methods for Adversarial Text Attacks
  • +4
  • Yidong Chai ,
  • Ruicheng Liang ,
  • Hongyi Zhu ,
  • Sagar Samtani ,
  • Meng Wang ,
  • Yezheng Liu ,
  • Yuanchun Jiang
Yidong Chai
Author Profile
Ruicheng Liang
Hefei University of Technology, Hefei University of Technology

Corresponding Author:[email protected]

Author Profile
Hongyi Zhu
Author Profile
Sagar Samtani
Author Profile
Meng Wang
Author Profile
Yezheng Liu
Author Profile
Yuanchun Jiang
Author Profile

Abstract

Deep learning models have significantly advanced various natural language processing tasks. However, they are strikingly vulnerable to adversarial text attacks, even in the black-box setting where no model knowledge is accessible to hackers. Such attacks are conducted with a two-phase framework: 1) a sensitivity estimation phase to evaluate each element’s sensitivity to the target model’s prediction, and 2) a perturbation execution phase to craft the adversarial examples based on estimated element sensitivity. This study explored the connections between the local post-hoc explainable methods for deep learning and black-box adversarial text attacks and proposed a novel eXplanation-based method for crafting Adversarial Text Attacks (XATA). XATA leverages local post-hoc explainable methods (e.g., LIME or SHAP) to measure input elements’ sensitivity and adopts the word replacement perturbation strategy to craft adversarial examples. We evaluated the attack performance of the proposed XATA on three commonly used text-based datasets: IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity. The proposed XATA outperformed existing baselines in various target models, including LSTM, GRU, CNN, and BERT. Moreover, we found that improved local post-hoc explainable methods (e.g., SHAP) lead to more effective adversarial attacks. These findings showed that when researchers constantly advance the explainability of deep learning models with local post-hoc methods, they also provide hackers with weapons to craft more targeted and dangerous adversarial attacks.
2023Published in IEEE Transactions on Knowledge and Data Engineering on pages 1-14. 10.1109/TKDE.2023.3270581