Abstract
On many video websites, the recommendation is
implemented as a prediction problem of video-user
pairs, where the videos are represented by text features extracted from the metadata. However, the
metadata is manually annotated by users and is usually missing for online videos. To train an effective
recommender system with lower annotation cost,
we propose an active learning approach to fully exploit the visual view of videos, while querying as
few annotations as possible from the text view. On
one hand, a joint model is proposed to learn the
mapping from visual view to text view by simultaneously aligning the two views and minimizing the
classification loss. On the other hand, a novel strategy based on prediction inconsistency and watching frequency is proposed to actively select the
most important videos for metadata querying. Experiments on both classification datasets and real
video recommendation tasks validate that the proposed approach can significantly reduce the annotation cost