Abstract
In this work we introduce a fully end-to-end approachfor action detection in videos that learns to directly predictthe temporal bounds of actions. Our intuition is that theprocess of detecting actions is naturally one of observationand refinement: observing moments in video, and refininghypotheses about when an action is occurring. Based onthis insight, we formulate our model as a recurrent neu-ral network-based agent that interacts with a video overtime. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent’s decisionpolicy. Our model achieves state-of-the-art results on the THUMOS’14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.