Abstract
When recommending or advertising items to users,
an emerging trend is to present each multimedia
item with a key frame image (e.g., the poster of
a movie). As each multimedia item can be represented as multiple fine-grained visual images (e.g.,
related images of the movie), personalized key
frame recommendation is necessary in these applications to attract users’ unique visual preferences.
However, previous personalized key frame recommendation models relied on users’ fine-grained image behavior of multimedia items (e.g., user-image
interaction behavior), which is often not available
in real scenarios. In this paper, we study the general
problem of joint multimedia item and key frame
recommendation in the absence of the fine-grained
user-image behavior. We argue that the key challenge of this problem lies in discovering users’
visual profiles for key frame recommendation, as
most recommendation models would fail without
any users’ fine-grained image behavior. To tackle
this challenge, we leverage users’ item behavior by
projecting users (items) in two latent spaces: a collaborative latent space and a visual latent space. We
further design a model to discern both the collaborative and visual dimensions of users, and model
how users make decisive item preferences from
these two spaces. As a result, the learned user visual profiles could be directly applied for key frame
recommendation. Finally, experimental results on
a real-world dataset clearly show the effectiveness
of our proposed model on the two recommendation
tasks