Abstract. Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial,
visual, and semantic world knowledge. Towards this goal, we present
the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it
while generating videos from novel captions. Craft explicitly predicts a
temporal-layout of mentioned entities (characters and objects), retrieves
spatio-temporal entity segments from a video database and fuses them
to generate scene videos. Our contributions include sequential training
of components of Craft while jointly modeling layout and appearances,
and losses that encourage learning compositional representations for retrieval. We evaluate Craft on semantic fidelity to caption, composition
consistency, and visual quality. Craft outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate Craft
on Flintstones4
, a new richly annotated video-caption dataset with
over 25000 videos. For a glimpse of videos generated by Craft, see
https://youtu.be/688Vv86n0z8