Abstract. First-person vision is gaining interest as it offers a unique
viewpoint on people’s interaction with objects, their attention, and even
intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we
introduce EPIC-KITCHENS, a large-scale egocentric video benchmark
recorded by 32 participants in their native kitchen environments. Our
videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants
belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M
frames, which we densely labelled for a total of 39.6K action segments
and 454.3K object bounding boxes. Our annotation is unique in that
we had the participants narrate their own videos (after recording), thus
reflecting true intention, and we crowd-sourced ground-truths based on
these. We describe our object, action and anticipation challenges, and
evaluate several baselines over two test splits, seen and unseen kitchens.
Keywords: Egocentric Vision, Dataset, Benchmarks, First-Person Vision, Egocentric Object Detection, Action Recognition and Anticipation