Abstract
Recently, adversarial deception becomes one of the
most considerable threats to deep neural networks.
However, compared to extensive research in new
designs of various adversarial attacks and defenses,
the neural networks’ intrinsic robustness property
is still lack of thorough investigation. This work
aims to qualitatively interpret the adversarial attack and defense mechanism through loss visualization, and establish a quantitative metric to evaluate the neural network model’s intrinsic robustness.
The proposed robustness metric identifies the upper bound of a model’s prediction divergence in
the given domain and thus indicates whether the
model can maintain a stable prediction. With extensive experiments, our metric demonstrates several advantages over conventional adversarial testing accuracy based robustness estimation: (1) it
provides a uniformed evaluation to models with different structures and parameter scales; (2) it overperforms conventional accuracy based robustness
estimation and provides a more reliable evaluation
that is invariant to different test settings; (3) it can
be fast generated without considerable testing cost