Abstract. We study the first-order scattering transform as a candidate for reducing the signal processed by a convolutional neural network
(CNN). We show theoretical and empirical evidence that in the case of
natural images and sufficiently small translation invariance, this transform preserves most of the signal information needed for classification
while substantially reducing the spatial resolution and total signal size.
We demonstrate that cascading a CNN with this representation performs on par with ImageNet classification models, commonly used in
downstream tasks, such as the ResNet-50. We subsequently apply our
trained hybrid ImageNet model as a base model on a detection system,
which has typically larger image inputs. On Pascal VOC and COCO detection tasks we demonstrate improvements in the inference speed and
training memory consumption compared to models trained directly on
the input image.