Abstract
Information need of humans is essentially
multimodal in nature, enabling maximum exploitation of situated context. We introduce a
dataset for sequential procedural (how-to) text
generation from images in cooking domain.
The dataset consists of 16,441 cooking recipes
with 160,479 photos associated with different
steps. We setup a baseline motivated by the
best performing model in terms of human evaluation for the Visual Story Telling (ViST) task.
In addition, we introduce two models to incorporate high level structure learnt by a Finite State Machine (FSM) in neural sequential generation process by: (1) Scaffolding
Structure in Decoder (SSiD) (2) Scaffolding
Structure in Loss (SSiL). Our best performing model (SSiL) achieves a METEOR score
of 0.31, which is an improvement of 0.6 over
the baseline model. We also conducted human
evaluation of the generated grounded recipes,
which reveal that 61% found that our proposed
(SSiL) model is better than the baseline model
in terms of overall recipes. We also discuss
analysis of the output highlighting key important NLP issues for prospective directions.