Abstract
Efficiently building an adversarial attacker for
natural language processing (NLP) tasks is a
real challenge. Firstly, as the sentence space
is discrete, it is difficult to make small perturbations along the direction of gradients. Secondly, the fluency of the generated examples
cannot be guaranteed. In this paper, we propose MHA, which addresses both problems
by performing Metropolis-Hastings sampling,
whose proposal is designed with the guidance
of gradients. Experiments on IMDB and SNLI
show that our proposed MHA outperforms the
baseline model on attacking capability. Adversarial training with MHA also leads to better
robustness and performance