Abstract
This work examines the robustness of selfattentive neural networks against adversarial
input perturbations. Specifically, we investigate the attention and feature extraction mechanisms of state-of-the-art recurrent neural networks and self-attentive architectures for sentiment analysis, entailment and machine translation under adversarial attacks. We also propose a novel attack algorithm for generating
more natural adversarial examples that could
mislead neural models but not humans. Experimental results show that, compared to recurrent neural models, self-attentive models are
more robust against adversarial perturbation.
In addition, we provide theoretical explanations for their superior robustness to support
our claims