Abstract. Multi-People Tracking in an open-world setting requires a
special effort in precise detection. Moreover, temporal continuity in the
detection phase gains more importance when scene cluttering introduces
the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts
and associates them across short temporal spans. Our model explicitly
deals with occluded body parts, by hallucinating plausible solutions of
not visible joints. We propose a new end-to-end architecture composed
by four branches (visible heatmaps, occluded heatmaps, part affinity fields
and temporal affinity fields) fed by a time linker feature extractor. To
overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset
for people tracking in urban scenarios by exploiting a photorealistic
videogame. It is up to now the vastest dataset (about 500.000 frames,
almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits
good generalization capabilities also on public real tracking benchmarks,
when image resolution and sharpness are high enough, producing reliable
tracklets useful for further batch data association or re-id modules