Abstract
Understanding traffic density from large-scale web camera (webcam) videos is a challenging problem because such
videos have low spatial and temporal resolution, high occlusion and large perspective. To deeply understand traffic
density, we explore both optimization based and deep learning based methods. To avoid individual vehicle detection or
tracking, both methods map the dense image feature into
vehicle density, one based on rank constrained regression
and the other based on fully convolutional networks (FCN).
The regression based method learns different weights for
different blocks of the image to embed road geometry and
significantly reduce the error induced by camera perspective. The FCN based method jointly estimates vehicle density and vehicle count with a residual learning framework
to perform end-to-end dense prediction, allowing arbitrary
image resolution, and adapting to different vehicle scales
and perspectives. We analyze and compare both methods,
and get insights from optimization based method to improve
deep model. Since existing datasets do not cover all the
challenges in our work, we collected and labelled a largescale traffic video dataset, containing 60 million frames
from 212 webcams. Both methods are extensively evaluated and compared on different counting tasks and datasets.
FCN based method significantly reduces the mean absolute
error (MAE) from 10.99 to 5.31 on the public dataset TRANCOS compared with the state-of-the-art baseline