Abstract
Cross-modal tasks occur naturally for multimedia content that can be described along two or more modalities like visual content and text. Such tasks require to “translate” information from one modality to another. Methods like kernelized canonical correlation analysis (KCCA) attempt tosolve such tasks by finding aligned subspaces in the description spaces of different modalities. Since they favor correla-tions against modality-specific information, these methodshave shown some success in both cross-modal and bi-modaltasks. However, we show that a direct use of the subspacealignment obtained by KCCA only leads to coarse trans-lation abilities. To address this problem, we first put for-ward a new representation method that aggregates informa-tion provided by the projections of both modalities on theiraligned subspaces. We further suggest a method relying on neighborhoods in these subspaces to complete uni-modalinformation. Our proposal exhibits state-of-the-art results for bi-modal classification on Pascal VOC07 and improves it by over 60% for cross-modal retrieval on FlickR 8K/30K.