Abstract
We propose a tracking framework that mediates grouping cues from two levels of tracking granularities, detection tracklets and point trajectories, for segmenting objects in crowded scenes. Detection tracklets capture objects when they are mostly visible. They may be sparse in time, may miss partially occluded or deformed objects, or contain false positives. Point trajectories are dense in space and time. Their affinities integrate long range motion and 3D disparity in- formation, useful for segmentation. Affinities may leak t hough across similarly moving objects, since they lack model knowledge. We establish one trajectory and one detection tracklet graph, encoding grouping affinitie s in each space and associations across. Two-granularity tracking is cast as simultaneous detection tracklet classification and clustering (cl2 ) in the joint space of tracklets and tra- jectories. We solve cl2 by explicitly mediating contradictory affinities in the two graphs: Detection tracklet classi fication modi fies trajectory affinities to reflect ob- ject speci fic dis-associations. Non-accidental grouping alignment between detec- tion tracklets and trajectory clusters boosts or rejects corresponding detection tracklets, changing accordingly their classi fication. We show our model can track objects through sparse, inaccurate detections and persistent partial occlusions. It adapts to the changing visibility masks of the targets, in contrast to detection based bounding box trackers, by effectively switching between the two granular- ities according to object occlusions, deformations and background clutter.