Abstract
In this paper, we present a detailed design of dynamic
video segmentation network (DVSNet) for fast and efficient
semantic video segmentation. DVSNet consists of two convolutional neural networks: a segmentation network and
a flow network. The former generates highly accurate semantic segmentations, but is deeper and slower. The latter
is much faster than the former, but its output requires further processing to generate less accurate semantic segmentations. We explore the use of a decision network to adaptively assign different frame regions to different networks
based on a metric called expected confidence score. Frame
regions with a higher expected confidence score traverse the
flow network. Frame regions with a lower expected con-
fidence score have to pass through the segmentation network. We have extensively performed experiments on various configurations of DVSNet, and investigated a number
of variants for the proposed decision network. The experimental results show that our DVSNet is able to achieve up
to 70.4% mIoU at 19.8 fps on the Cityscape dataset. A high
speed version of DVSNet is able to deliver an fps of 30.4
with 63.2% mIoU on the same dataset. DVSNet is also able
to reduce up to 95% of the computational workloads.