Abstract. Video objection detection is challenging in the presence of
appearance deterioration in certain video frames. One of typical solutions
is to enhance per-frame features through aggregating neighboring frames.
But the features of objects are usually not spatially calibrated across
frames due to motion from object and camera. In this paper, we propose an end-to-end model called fully motion-aware network (MANet),
which jointly calibrates the features of objects on both pixel-level and
instance-level in a unified framework. The pixel-level calibration is flexible in modeling detailed motion while the instance-level calibration captures more global motion cues in order to be robust to occlusion. To our
best knowledge, MANet is the first work that can jointly train the two
modules and dynamically combine them according to the motion patterns. It achieves leading performance on the large-scale ImageNet VID
dataset