DeepInspect: A Black-box Trojan Detection and Mitigation Framework for
Deep Neural Networks
Abstract
Deep Neural Networks (DNNs) are vulnerable to
Neural Trojan (NT) attacks where the adversary injects malicious behaviors during DNN training. This
type of ‘backdoor’ attack is activated when the input
is stamped with the trigger pattern specified by the
attacker, resulting in an incorrect prediction of the
model. Due to the wide application of DNNs in
various critical fields, it is indispensable to inspect
whether the pre-trained DNN has been trojaned before employing a model. Our goal in this paper is
to address the security concern on unknown DNN
to NT attacks and ensure safe model deployment.
We propose DeepInspect, the first black-box Trojan
detection solution with minimal prior knowledge
of the model. DeepInspect learns the probability
distribution of potential triggers from the queried
model using a conditional generative model, thus
retrieves the footprint of backdoor insertion. In addition to NT detection, we show that DeepInspect’s
trigger generator enables effective Trojan mitigation by model patching. We corroborate the effectiveness, efficiency, and scalability of DeepInspect
against the state-of-the-art NT attacks across various benchmarks. Extensive experiments show that
DeepInspect offers superior detection performance
and lower runtime overhead than the prior work