Abstract
Representing procedure text such as recipe for crossmodal retrieval is inherently a difficult problem, not mentioning to generate image from recipe for visualization. This
paper studies a new version of GAN, named Recipe Retrieval Generative Adversarial Network (R2GAN), to explore the feasibility of generating image from procedure text
for retrieval problem. The motivation of using GAN is twofold: learning compatible cross-modal features in an adversarial way, and explanation of search results by showing
the images generated from recipes. The novelty of R2GAN
comes from architecture design, specifically a GAN with one
generator and dual discriminators is used, which makes
the generation of image from recipe a feasible idea. Furthermore, empowered by the generated images, a two-level
ranking loss in both embedding and image spaces are considered. These add-ons not only result in excellent retrieval
performance, but also generate close-to-realistic food images useful for explaining ranking of recipes. On recipe1M
dataset, R2GAN demonstrates high scalability to data size,
outperforms all the existing approaches, and generates images intuitive for human to interpret the search results.