Abstract
Adversarial examples induce model classification
errors on purpose, which has raised concerns on
the security aspect of machine learning techniques.
Many existing countermeasures are compromised
by adaptive adversaries and transferred examples.
We propose a model-agnostic approach to resolve
the problem by analysing the model responses to
an input under random perturbations, and study the
robustness of detecting norm-bounded adversarial
distortions in a theoretical framework. Extensive
evaluations are performed on the MNIST, CIFAR-
10 and ImageNet datasets. The results demonstrate
that our detection method is effective and resilient
against various attacks including black-box attacks
and the powerful CW attack with four adversarial
adaptations