Abstract. We propose a Spatiotemporal Sampling Network (STSN)
that uses deformable convolutions across time for object detection in
videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally
renders the approach robust to occlusion or motion blur in individual
frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet
VID dataset and compared to prior video object detection methods it
uses a simpler design, and does not require optical flow data for training