Abstract
We propose TAL-Net, an improved approach to temporal
action localization in video that is inspired by the Faster RCNN object detection framework. TAL-Net addresses three
key shortcomings of existing approaches: (1) we improve
receptive field alignment using a multi-scale architecture
that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions
for both proposal generation and action classification by
appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate
that fusing motion late is important. We achieve state-ofthe-art performance for both action proposal and localization on THUMOS’14 detection benchmark and competitive
performance on ActivityNet challenge