Abstract
We propose an end-to-end network for visual illustration
of a sequence of sentences forming a story.At the core
of our model is the ability to model the inter-related na-
ture of the sentences within a story,as well as the ability to
learn coherence to support reference resolution.The frame-
work takes the form of an encoder-decoder architecture,
where sentences are encoded using a hierarchical two-level
sentence-story GRU,combined with an encoding of coher-
ence,and sequentially decoded using a predicted feature
representation into a consistent illustrative image sequence.
We optimize all parameters of our network in an end-to-
end fashion with respect to order embedding loss,encoding
entailment between images and sentences.Experiments on
the VIST storytelling dataset[9]highlight the importance of
our algorithmic choices and efficacy of our overall model