Abstract
Both convolutional and recurrent operations are building
blocks that process one local neighborhood at a time. In
this paper, we present non-local operations as a generic
family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method
[4] in computer vision, our non-local operation computes
the response at a position as a weighted sum of the features
at all positions. This building block can be plugged into
many computer vision architectures. On the task of video
classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition
winners on both Kinetics and Charades datasets. In static
image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite
of tasks. Code will be made available