Abstract
We address the problem of describing people based on fifine-grained clothing attributes. This is an important problem for many practical applications, such as identifying target suspects or fifinding missing people based on detailed clothing descriptions in surveillance videos or consumer photos. We approach this problem by fifirst mining clothing images with fifine-grained attribute labels from online shopping stores. A large-scale dataset is built with about one million images and fifine-detailed attribute subcategories, such as various shades of color (e.g., watermelon red, rosy red, purplish red), clothing types (e.g., down jacket, denim jacket), and patterns (e.g., thin horizontal stripes, houndstooth). As these images are taken in ideal pose/lighting/background conditions, it is unreliable to directly use them as training data for attribute prediction in the domain of unconstrained images captured, for example, by mobile phones or surveillance cameras. In order to bridge this gap, we propose a novel double-path deep domain adaptation network to model the data from the two domains jointly. Several alignment cost layers placed inbetween the two columns ensure the consistency of the two domain features and the feasibility to predict unseen attribute categories in one of the domains. Finally, to achieve a working system with automatic human body alignment, we trained an enhanced RCNN-based detector to localize human bodies in images. Our extensive experimental evaluation demonstrates the effectiveness of the proposed approach for describing people based on fifine-grained clothing attributes.