Abstract. In video captioning task, the best practice has been achieved by attentionbased models which associate salient visual components with sentences in the
video. However, existing study follows a common procedure which includes a
frame-level appearance modeling and motion modeling on equal interval frame
sampling, which may bring about redundant visual information, sensitivity to
content noise and unnecessary computation cost. We propose a plug-and-play
PickNet to perform informative frame picking in video captioning. Based on
a standard encoder-decoder framework, we develop a reinforcement-learningbased procedure to train the network sequentially, where the reward of each
frame picking action is designed by maximizing visual diversity and minimizing discrepancy between generated caption and the ground-truth. The rewarded
candidate will be selected and the corresponding latent representation of encoderdecoder will be updated for future trials. This procedure goes on until the end of
the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance
degradation. Experiment results show that our model can achieve competitive
performance across popular benchmarks while only 6?8 frames are used