Swell-and-Shrink: Decomposing Image Captioning by Transformation and
Summarization
Abstract
Image captioning is currently viewed as a problem analogous to machine translation. However,
it always suffers from poor interpretability, coarse
or even incorrect descriptions on regional details. Moreover, information abstraction and compression, as essential characteristics of captioning, are
always overlooked and seldom discussed. To overcome the shortcomings, a swell-shrink method is
proposed to redefine image captioning as a compositional task which consists of two separated
modules: modality transformation and text compression. The former is guaranteed to accurately
transform adequate visual content into textual form while the latter consists of a hierarchical LSTM which particularly emphasizes on removing the
redundancy among multiple phrases and organizing the final abstractive caption. Additionally, the
order and quality of region of interest and modality processing are studied to give insights of better
understanding the influence of regional visual cues
on language forming. Experiments demonstrate the
effectiveness of the proposed method