Abstract
Despite recent progress in computer vision, finegrained interpretation of satellite images remains
challenging because of a lack of labeled training
data. To overcome this limitation, we construct
a novel dataset called WikiSatNet by pairing georeferenced Wikipedia articles with satellite imagery
of their corresponding locations. We then propose
two strategies to learn representations of satellite
images by predicting properties of the corresponding articles from the images. Leveraging this new
multi-modal dataset, we can drastically reduce the
quantity of human-annotated labels and time required for downstream tasks. On the recently released fMoW dataset, our pre-training strategies
can boost the performance of a model pre-trained
on ImageNet by up to 4.5% in F1 score