Abstract
Neural networks are vulnerable to adversarial attacks - small visually imperceptible crafted noise
which when added to the input drastically changes
the output. The most effective method of defending against adversarial attacks is to use the methodology of adversarial training. We analyze the adversarially trained robust models to study their vulnerability against adversarial attacks at the level of
the latent layers. Our analysis reveals that contrary to the input layer which is robust to adversarial attack, the latent layer of these robust models are highly susceptible to adversarial perturbations of small magnitude. Leveraging this information, we introduce a new technique Latent Adversarial Training (LAT) which comprises of finetuning the adversarially trained models to ensure
the robustness at the feature layers. We also propose Latent Attack (LA), a novel algorithm for constructing adversarial examples. LAT results in a
minor improvement in test accuracy and leads to
a state-of-the-art adversarial accuracy against the
universal first-order adversarial PGD attack which
is shown for the MNIST, CIFAR-10, CIFAR-100,
SVHN and Restricted ImageNet datasets