Abstract. The automatic evaluation of image descriptions is an intricate task, and it is highly important in the development and fine-grained
analysis of captioning systems. Existing metrics to automatically evaluate image captioning systems fail to achieve a satisfactory level of correlation with human judgements at the sentence level. Moreover, these
metrics, unlike humans, tend to focus on specific aspects of quality, such
as the n-gram overlap or the semantic meaning. In this paper, we present
the first learning-based metric to evaluate image captions. Our proposed
framework enables us to incorporate both lexical and semantic information into a single learned metric. This results in an evaluator that takes
into account various linguistic features to assess the caption quality. The
experiments we performed to assess the proposed metric, show improvements upon the state of the art in terms of correlation with human
judgements and demonstrate its superior robustness to distractions