Con-Detect: Detecting Adversarially Perturbed Natural Language Inputs to
Deep Classifiers Through Holistic Analysis
Abstract
Deep Learning (DL) algorithms have shown wonders in many Natural
Language Processing (NLP) tasks such as language-to-language
translation, spam filtering, fake-news detection, and comprehension
understanding. However, research has shown that the adversarial
vulnerabilities of deep learning networks manifest themselves when DL is
used for NLP tasks. Most mitigation techniques proposed to date are
supervised—relying on adversarial retraining to improve the
robustness—which is impractical. This work introduces a novel,
unsupervised detection methodology for detecting adversarial inputs to
NLP classifiers. In summary, we note that minimally perturbing an input
to change a model’s output—a major strength of adversarial
attacks—is a weakness that leaves unique statistical marks reflected
in the cumulative contribution scores of the input. Particularly, we
show that the cumulative contribution score, called CF-score, of
adversarial inputs is generally greater than that of the clean inputs.
We thus propose Con-Detect—a Contribution based Detection method—for
detecting adversarial attacks against NLP classifiers. Con-Detect can be
deployed with any classifier without having to retrain it. We experiment
with multiple attackers—Text-bugger, Text-fooler, PWWS—on several
architectures—MLP, CNN, LSTM, Hybrid CNN-RNN, BERT—trained for
different classification tasks—IMDB sentiment classification,
fake-news classification, AG news topic classification—under different
threat models—Con-Detect-blind attacks, Con-Detect-aware attacks, and
Con-Detect-adaptive attacks—and show that Con-Detect can reduce the
attack success rate (ASR) of different attacks from 100% to as low as
0% for the best cases and =70% for the worst case. Even in the worst
case, we note a 100% increase in the required number of queries and a
50% increase in the number of words perturbed, suggesting that
Con-Detect is hard to evade.