Abstract
Surveillance video parsing, which segments the video
frames into several labels, e.g., face, pants, left-leg, has
wide applications [41, 8]. However, pixel-wisely annotating all frames is tedious and inefficient. In this paper, we
develop a Single frame Video Parsing (SVP) method which
requires only one labeled frame per video in training stage.
To parse one particular frame, the video segment preceding
the frame is jointly considered. SVP (i) roughly parses the
frames within the video segment, (ii) estimates the optical
flow between frames and (iii) fuses the rough parsing results warped by optical flow to produce the refined parsing
result. The three components of SVP, namely frame parsing, optical flow estimation and temporal fusion are integrated in an end-to-end manner. Experimental results on
two surveillance video datasets show the superiority of SVP
over state-of-the-arts. The collected video parsing datasets
can be downloaded via http://liusi-group.com/
projects/SVP for the further studies.