Abstract. While machine learning approaches to visual emotion recognition offer great promise, current methods consider training and testing models on small scale datasets covering limited visual emotion concepts. Our analysis identifies an important but long overlooked issue of
existing visual emotion benchmarks in the form of dataset biases. We
design a series of tests to show and measure how such dataset biases obstruct learning a generalizable emotion recognition model. Based on our
analysis, we propose a webly supervised approach by leveraging a large
quantity of stock image data. Our approach uses a simple yet effective
curriculum guided training strategy for learning discriminative emotion
features. We discover that the models learned using our large scale stock
image dataset exhibit significantly better generalization ability than the
existing datasets without the manual collection of even a single label.
Moreover, visual representation learned using our approach holds a lot
of promise across a variety of tasks on different image and video datasets