DeepVoting: A Robust and Explainable Deep Network
for Semantic Part Detection under Partial Occlusion
Abstract
In this paper, we study the task of detecting semantic
parts of an object, e.g., a wheel of a car, under partial occlusion. We propose that all models should be trained without
seeing occlusions while being able to transfer the learned
knowledge to deal with occlusions. This setting alleviates
the difficulty in collecting an exponentially large dataset to
cover occlusion patterns and is more essential. In this scenario, the proposal-based deep networks, like RCNN-series,
often produce unsatisfactory results, because both the proposal extraction and classification stages may be confused
by the irrelevant occluders. To address this, [25] proposed a
voting mechanism that combines multiple local visual cues
to detect semantic parts. The semantic parts can still be detected even though some visual cues are missing due to occlusions. However, this method is manually-designed, thus
is hard to be optimized in an end-to-end manner.
In this paper, we present DeepVoting, which incorporates
the robustness shown by [25] into a deep network, so that
the whole pipeline can be jointly optimized. Specifically, it
adds two layers after the intermediate features of a deep
network, e.g., the pool-4 layer of VGGNet. The first layer
extracts the evidence of local visual cues, and the second
layer performs a voting mechanism by utilizing the spatial
relationship between visual cues and semantic parts. We
also propose an improved version DeepVoting+ by learning
visual cues from context outside objects. In experiments,
DeepVoting achieves significantly better performance than
several baseline methods, including Faster-RCNN, for semantic part detection under occlusion. In addition, DeepVoting enjoys explainability as the detection results can be
diagnosed via looking up the voting cues