Abstract
Image-text matching is a vital cross-modality task
in artificial intelligence and has attracted increasing attention in recent years. Existing works have
shown that learning semantic concepts is useful
to enhance image representation and can signifi-
cantly improve the performance of both image-totext and text-to-image retrieval. However, existing models simply detect semantic concepts from
a given image, which are less likely to deal with
long-tail and occlusion concepts. Frequently cooccurred concepts in the same scene, e.g. bedroom
and bed, can provide common-sense knowledge to
discover other semantic-related concepts. In this
paper, we develop a Scene Concept Graph (SCG)
by aggregating image scene graphs and extracting frequently co-occurred concept pairs as scene
common-sense knowledge. Moreover, we propose
a novel model to incorporate this knowledge to improve image-text matching. Specifically, semantic concepts are detected from images and then expanded by the SCG. After learning to select relevant contextual concepts, we fuse their representations with the image embedding feature to feed into
the matching module. Extensive experiments are
conducted on Flickr30K and MSCOCO datasets,
and prove that our model achieves state-of-the-art
results due to the effectiveness of incorporating the
external SCG