Abstract. The key point of image-text matching is how to accurately
measure the similarity between visual and textual inputs. Despite the
great progress of associating the deep cross-modal embeddings with the
bi-directional ranking loss, developing the strategies for mining useful
triplets and selecting appropriate margins remains a challenge in real
applications. In this paper, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC)
loss for learning discriminative image-text embeddings. The CMPM loss
minimizes the KL divergence between the projection compatibility distributions and the normalized matching distributions defined with all
the positive and negative samples in a mini-batch. The CMPC loss attempts to categorize the vector projection of representations from one
modality onto another with the improved norm-softmax loss, for further enhancing the feature compactness of each class. Extensive analysis
and experiments on multiple datasets demonstrate the superiority of the
proposed approach.