Abstract
In this work we introduce a time- and memory-efficient
method for structured prediction that couples neuron decisions across both space at time. We show that we are
able to perform exact and efficient inference on a denselyconnected spatio-temporal graph by capitalizing on recent
advances on deep Gaussian Conditional Random Fields
(GCRFs). Our method, called VideoGCRF is (a) effi-
cient, (b) has a unique global minimum, and (c) can
be trained end-to-end alongside contemporary deep networks for video understanding. We experiment with multiple connectivity patterns in the temporal domain, and
present empirical improvements over strong baselines on
the tasks of both semantic and instance segmentation of
videos. Our implementation is based on the Caffe2 framework and will be available at https://github.com/
siddharthachandra/gcrf-v3.0