Tamp-X: Attacking Explainable Natural Language Classifiers Through
Tampered Activations
Abstract
While the technique of Deep Neural Networks (DNNs) has been instrumental
in achieving state-of-the-art results for various Natural Language
Processing (NLP) tasks, recent works have shown that the decisions made
by DNNs cannot always be trusted. Recently Explainable Artificial
Intelligence (XAI) methods have been proposed as a method for increasing
DNN’s reliability and trustworthiness. These XAI methods are however
open to attack and can be manipulated in both white-box (gradient-based)
and black-box (perturbation-based) scenarios. Exploring novel techniques
to attack and robustify these XAI methods is crucial to fully understand
these vulnerabilities. In this work, we propose Tamp-X—a novel
attack which tampers the activations of robust NLP classifiers forcing
the state-of-the-art white-box and black-box XAI methods to generate
misrepresented explanations. To the best of our knowledge, in current
NLP literature, we are the first to attack both the white-box and the
black-box XAI methods simultaneously. We quantify the reliability of
explanations based on three different metrics—the descriptive
accuracy, the cosine similarity, and the Lp norms
of the explanation vectors. Through extensive experimentation, we show
that the explanations generated for the tampered classifiers are not
reliable, and significantly disagree with those generated for the
untampered classifiers despite that the output decisions of tampered and
untampered classifiers are almost always the same. Additionally, we
study the adversarial robustness of the tampered NLP classifiers, and
find out that the tampered classifiers which are harder to explain for
the XAI methods, are also harder to attack by the adversarial attackers.