Abstract
In this work, we study a poorly understood trade-off between accuracy and runtime costs for deep semantic video
segmentation. While recent work has demonstrated advantages of learning to speed-up deep activity detection, it is
not clear if similar advantages will hold for our very different segmentation loss function, which is defined over individual pixels across the frames. In deep video segmentation, the most time consuming step represents the application of a CNN to every frame for assigning class labels to every pixel, typically taking 6-9 times of the video
footage. This motivates our new budget-aware framework
that learns to optimally select a small subset of frames
for pixelwise labeling by a CNN, and then efficiently interpolates the obtained segmentations to yet unprocessed
frames. This interpolation may use either a simple optical-
flow guided mapping of pixel labels, or another signifi-
cantly less complex and thus faster CNN. We formalize the
frame selection as a Markov Decision Process, and specify a Long Short-Term Memory (LSTM) network to model a
policy for selecting the frames. For training the LSTM, we
develop a policy-gradient reinforcement-learning approach
for approximating the gradient of our non-decomposable
and non-differentiable objective. Evaluation on two benchmark video datasets show that our new framework is able to
significantly reduce computation time, and maintain competitive video segmentation accuracy under varying budgets.