Abstract
In recent years, both online retail and video hosting service have been exponentially grown. In this paper, a novel
deep neural network, called AsymNet, is proposed to explore a new cross-domain task, Video2Shop, targeting for
matching clothes appeared in videos to the exactly same
items in online shops. For the image side, well-established
methods are used to detect and extract features for clothing patches with arbitrary sizes. For the video side, deep
visual features are extracted from detected object regions in each frame, and further fed into a Long Short-Term
Memory (LSTM) framework for sequence modeling, which
captures the temporal dynamics in videos. To conduct exact matching between videos and online shopping images,
LSTM hidden states for videos and image features extracted
from static images are jointly modeled, under the similarity
network with reconfigurable deep tree structure. Moreover,
an approximate training method is proposed to achieve the
efficiency when training. Extensive experiments conducted
on a large cross-domain dataset have demonstrated the effectiveness and efficiency of the proposed AsymNet, which
outperforms the state-of-the-art methods