Abstract. In this paper, we propose a novel encoder-decoder network,
called Scale Aggregation Network (SANet), for accurate and efficient
crowd counting. The encoder extracts multi-scale features with scale
aggregation modules and the decoder generates high-resolution density
maps by using a set of transposed convolutions. Moreover, we find that
most existing works use only Euclidean loss which assumes independence among each pixel but ignores the local correlation in density maps.
Therefore, we propose a novel training loss, combining of Euclidean loss
and local pattern consistency loss, which improves the performance of
the model in our experiments. In addition, we use normalization layers to
ease the training process and apply a patch-based test scheme to reduce
the impact of statistic shift problem. To demonstrate the effectiveness of
the proposed method, we conduct extensive experiments on four major
crowd counting datasets and our method achieves superior performance
to state-of-the-art methods while with much less parameters