Abstract
This paper is about temporal segmentation of human
actions in videos. We introduce a new model – temporal
deformable residual network (TDRN) – aimed at analyzing
video intervals at multiple temporal scales for labeling video
frames. Our TDRN computes two parallel temporal streams:
i) Residual stream that analyzes video information at its full
temporal resolution, and ii) Pooling/unpooling stream that
captures long-range video information at different scales.
The former facilitates local, fine-scale action segmentation,
and the latter uses multiscale context for improving accuracy
of frame classification. These two streams are computed by
a set of temporal residual modules with deformable convolutions, and fused by temporal residuals at the full video
resolution. Our evaluation on the University of Dundee 50
Salads, Georgia Tech Egocentric Activities, and JHU-ISI
Gesture and Skill Assessment Working Set demonstrates that
TDRN outperforms the state of the art in frame-wise segmentation accuracy, segmental edit score, and segmental overlap
F1 score