Abstract
We propose a novel crowd counting model that maps a
given crowd scene to its density. Crowd analysis is compounded by myriad of factors like inter-occlusion between
people due to extreme crowding, high similarity of appearance between people and background elements, and large
variability of camera view-points. Current state-of-the art
approaches tackle these factors by using multi-scale CNN
architectures, recurrent networks and late fusion of features
from multi-column CNN with different receptive fields. We
propose switching convolutional neural network that leverages variation of crowd density within an image to improve
the accuracy and localization of the predicted crowd count.
Patches from a grid within a crowd scene are relayed to
independent CNN regressors based on crowd count prediction quality of the CNN established during training. The
independent CNN regressors are designed to have different
receptive fields and a switch classifier is trained to relay the
crowd scene patch to the best CNN regressor. We perform
extensive experiments on all major crowd counting datasets
and evidence better performance compared to current stateof-the-art methods. We provide interpretable representations of the multichotomy of space of crowd scene patches
inferred from the switch. It is observed that the switch relays an image patch to a particular CNN column based on
density of crowd