PURIFYING ADVERSARIAL IMAGES USING ADVERSARIAL AUTOENCODERWITH
CONDITIONAL NORMALIZING FLOWS
Abstract
We present a target-agnostic adversarial autoencoder with conditional
normalizing flows specifically designed to, given any unlabeled image
dataset, purify adversarial samples into clean images, i.e., remove
adversarial noise from the images while preserving their visual quality.
In our model interpretation, samples are processed by manifold
projection in which the encoder brings the sample back into a posterior
data distribution in latent space so that the sample is less likely to
be irregular to the learned representation of any target classifier.
Normalizing flows conditioned on top of our hybrid network structure and
walk-back training are used to deal with common drawbacks of generative
model and autoencoder-based approaches: not only the trade-off between
compression loss and over-fitting on training data but also the
structural model dependency on dataset classes and labels. Experiments
demonstrated that our proposed model is preferable to existing
target-agnostic adversarial defense methods particularly for large and
unlabeled image datasets.