Generating Natural Language Adversarial Examples
through Probability Weighted Word Saliency
Abstract
We address the problem of adversarial attacks
on text classification, which is rarely studied
comparing to attacks on image classification.
The challenge of this task is to generate adversarial examples that maintain lexical correctness, grammatical correctness and semantic similarity. Based on the synonyms substitution strategy, we introduce a new word replacement order determined by both the word
saliency and the classification probability, and
propose a greedy algorithm called probability
weighted word saliency (PWWS) for text adversarial attack. Experiments on three popular
datasets using convolutional as well as LSTM
models show that PWWS reduces the classifi-
cation accuracy to the most extent, and keeps
a very low word substitution rate. A human
evaluation study shows that our generated adversarial examples maintain the semantic similarity well and are hard for humans to perceive.
Performing adversarial training using our perturbed datasets improves the robustness of the
models. At last, our method also exhibits a
good transferability on the generated adversarial examples.