Abstract
Current object detection approaches predict bounding
boxes that provide little instance-specific information beyond location, scale and aspect ratio. In this work, we
propose to regress directly to objects’ shapes in addition to
their bounding boxes and categories. It is crucial to find an
appropriate shape representation that is compact and decodable, and in which objects can be compared for higherorder concepts such as view similarity, pose variation and
occlusion. To achieve this, we use a denoising convolutional
auto-encoder to learn a low-dimensional shape embedding
space. We place the decoder network after a fast end-toend deep convolutional network that is trained to regress
directly to the shape vectors provided by the auto-encoder.
This yields what to the best of our knowledge is the first
real-time shape prediction network, running at 35 FPS on a
high-end desktop. With higher-order shape reasoning wellintegrated into the network pipeline, the network shows the
useful practical quality of generalising to unseen categories
that are similar to the ones in the training set, something
that most existing approaches fail to handle.