TechRxiv
Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression.pdf (1.61 MB)
Download file

Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression

Download (1.61 MB)
preprint
posted on 25.05.2022, 21:57 by Yidong Chai, Ruicheng LiangRuicheng Liang, Sagar Samtani, Hongyi Zhu, Meng Wang, Yezheng Liu, Yuanchun Jiang

 Deep learning (DL) models have significantly improved the performance of text classification and text regression tasks. However, DL models are often strikingly vulnerable to adversarial attacks. Many researchers have aimed to develop adversarial attacks against DL models in realistic black-box settings (i.e., assumes no model knowledge is accessible to attackers) that typically operate with a two-phase framework: (1) sensitivity estimation through gradient-based or deletion-based methods to evaluate the sensitivity of each token to the prediction of the target model, and (2) perturbation execution to craft adversarial examples based on the estimated token sensitivity. However, gradient-based and deletion-based methods used to estimate sensitivity often face issues of capturing the directionality of tokens and overlapping token sensitivities, respectively. In this study, we propose a novel eXplanation-based method for Adversarial Text Attacks (XATA) that leverages additive feature attribution explainable methods, namely LIME or SHAP, to measure the sensitivity of input tokens when crafting black-box adversarial attacks on DL models performing text classification or text regression. We evaluated XATA’s attack performance on DL models executing text classification on three datasets (IMDB Movie Review, Yelp Reviews-Polarity, and Amazon Reviews-Polarity) and DL models conducting text regression on three datasets (My Personality, Drug Review, and CommonLit Readability). The proposed XATA outperformed the existing gradient-based and deletion-based adversarial attack baselines in both tasks. These findings indicate that the ever-growing research focused on improving the explainability of DL models with additive feature attribution explainable methods can provide attackers with weapons to launch targeted adversarial attacks. 

Funding

72101079

71722010

72171071

91746302

91846201

JZ2021HGPA0060

History

Email Address of Submitting Author

rcliang@mail.hfut.edu.cn

ORCID of Submitting Author

0000-0001-6266-2657

Submitting Author's Institution

Hefei University of Technology

Submitting Author's Country

China