Abstract
Analyzing human actions in videos has gained increased
attention recently. While most works focus on classifying
and labeling observed video frames or anticipating the very
recent future, making long-term predictions over more than
just a few seconds is a task with many practical applications
that has not yet been addressed. In this paper, we propose
two methods to predict a considerably large amount of future actions and their durations. Both, a CNN and an RNN
are trained to learn future video labels based on previously
seen content. We show that our methods generate accurate
predictions of the future even for long videos with a huge
amount of different actions and can even deal with noisy or
erroneous input information.