Abstract. We introduce a method for improving convolutional neural
networks (CNNs) for scene classification. We present a hierarchy of specialist networks, which disentangles the intra-class variation and interclass similarity in a coarse to fine manner. Our key insight is that each
subset within a class is often associated with different types of inter-class
similarity. This suggests that existing network of experts approaches that
organize classes into coarse categories are suboptimal. In contrast, we
group images based on high-level appearance features rather than their
class membership and dedicate a specialist model per group. In addition, we propose an alternating architecture with a global ordered- and
a global orderless-representation to account for both the coarse layout of
the scene and the transient objects. We demonstrate that it leads to better performance than using a single type of representation as well as the
fused features. We also introduce a mini-batch soft k-means that allows
end-to-end fine-tuning, as well as a novel routing function for assigning
images to specialists. Experimental results show that the proposed approach achieves a significant improvement over baselines including the
existing tree-structured CNNs with class-based grouping