Abstract
Integrating complementary features from multiple channels is expected to solve the description ambiguity problem
in video captioning, whereas inappropriate fusion strategies often harm rather than help the performance. Existing
static fusion methods in video captioning such as concatenation and summation cannot attend to appropriate feature
channels, thus fail to adaptively support the recognition of
various kinds of visual entities such as actions and objects. This paper contributes to: 1)The first in-depth study of
the weakness inherent in data-driven static fusion methods
for video captioning. 2) The establishment of a task-driven
dynamic fusion (TDDF) method. It can adaptively choose
different fusion patterns according to model status. 3) The
improvement of video captioning. Extensive experiments
conducted on two well-known benchmarks demonstrate that
our dynamic fusion method outperforms the state-of-the-art
results on MSVD with METEOR scores 0.333, and achieves
superior METEOR scores 0.278 on MSR-VTT-10K. Compared to single features, the relative improvement derived
from our fusion method are 10.0% and 5.7% respectively
on two datasets.