Abstract
We are dealing with the problem of fine-grained vehi-cle make & model recognition and verification. Our contri-bution is showing that extracting additional data from thevideo stream – besides the vehicle image itself – and feed-ing it into the deep convolutional neural network boosts the recognition performance considerably. This additional information includes: 3D vehicle bounding box used for “unpacking” the vehicle image, its rasterized low-resolutionshape, and information about the 3D vehicle orientation. Experiments show that adding such information decreases classification error by 26 % (the accuracy is improved from 0.772 to 0.832) and boosts verification average precision by 208 % (0.378 to 0.785) compared to baseline pure CNN without any input modifications. Also, the pure baseline CNN outperforms the recent state of the art solution by 0.081. We provide an annotated set “BoxCars” of surveillance vehicle images augmented by various automatically extracted auxiliary information. Our approach and the dataset can considerably improve the performance of traffic surveillance systems.