Abstract
Recent advances in Deep Learning show the existence
of image-agnostic quasi-imperceptible perturbations that
when applied to ‘any’ image can fool a state-of-the-art network classifier to change its prediction about the image
label. These ‘Universal Adversarial Perturbations’ pose
a serious threat to the success of Deep Learning in practice. We present the first dedicated framework to effectively
defend the networks against such perturbations. Our approach learns a Perturbation Rectifying Network (PRN) as
‘pre-input’ layers to a targeted model, such that the targeted
model needs no modification. The PRN is learned from real
and synthetic image-agnostic perturbations, where an ef-
ficient method to compute the latter is also proposed. A
perturbation detector is separately trained on the Discrete
Cosine Transform of the input-output difference of the PRN.
A query image is first passed through the PRN and verified
by the detector. If a perturbation is detected, the output of
the PRN is used for label prediction instead of the actual image. A rigorous evaluation shows that our framework can
defend the network classifiers against unseen adversarial
perturbations in the real-world scenarios with up to 97.5%
success rate. The PRN also generalizes well in the sense
that training for one targeted network defends another network with a comparable success rate