Abstract
This paper provides a very simple yet effective characterlevel architecture for learning bidirectional retrieval models. Aligning multimodal content is particularly challenging considering the difficulty in finding semantic correspondence between images and descriptions. We introduce an
efficient character-level inception module, designed to learn
textual semantic embeddings by convolving raw characters
in distinct granularity levels. Our approach is capable of
explicitly encoding hierarchical information from distinct
base-level representations (e.g., characters, words, and sentences) into a shared multimodal space, where it maps
the semantic correspondence between images and descriptions via a contrastive pairwise loss function that minimizes
order-violations. Models generated by our approach are far
more robust to input noise than state-of-the-art strategies
based on word-embeddings. Despite being conceptually
much simpler and requiring fewer parameters, our models
outperform the state-of-the-art approaches by 4.8% in the
task of description retrieval and 2.7% (absolute R@1 values) in the task of image retrieval in the popular MS COCO
retrieval dataset. We also show that our models present
solid performance for text classification, specially in multilingual and noisy domains