Abstract
Interpretability of a deep neural network aims to explain
the rationale behind its decisions and enable the users to
understand the intelligent agents, which has become an
important issue due to its importance in practical applications. To address this issue, we develop a Distillation
Guided Routing method, which is a flexible framework to
interpret a deep neural network by identifying critical data
routing paths and analyzing the functional processing behavior of the corresponding layers. Specifically, we propose
to discover the critical nodes on the data routing paths during network inferring prediction for individual input samples by learning associated control gates for each layer’s
output channel. The routing paths can, therefore, be represented based on the responses of concatenated control gates
from all the layers, which reflect the network’s semantic selectivity regarding to the input patterns and more detailed
functional process across different layer levels. Based on
the discoveries, we propose an adversarial sample detection
algorithm by learning a classifier to discriminate whether
the critical data routing paths are from real or adversarial samples. Experiments demonstrate that our algorithm
can effectively achieve high defense rate with minor training overhead.