Abstract
In this paper, we introduce Recipe1M, a new large-scale,
structured corpus of over 1m cooking recipes and 800k food
images. As the largest publicly available collection of recipe
data, Recipe1M affords the ability to train high-capacity
models on aligned, multi-modal data. Using these data, we
train a neural network to find a joint embedding of recipes
and images that yields impressive results on an image-recipe
retrieval task. Additionally, we demonstrate that regularization via the addition of a high-level classification objective
both improves retrieval performance to rival that of humans
and enables semantic vector arithmetic. We postulate that
these embeddings will provide a basis for further exploration
of the Recipe1M dataset and food and cooking in general.
Code, data and models are publicly available