Abstract
Each corner of the inhabited world is imaged from mul-tiple viewpoints with increasing frequency. Online map ser-vices like Google Maps or Here Maps provide direct access to huge amounts of densely sampled, georeferenced images from street view and aerial perspective. There is an oppor-tunity to design computer vision systems that will help ussearch, catalog and monitor public infrastructure, buildings and artifacts. We explore the architecture and feasibility ofsuch a system. The main technical challenge is combin-ing test time information from multiple views of each geo-graphic location (e.g., aerial and street views). We implement two modules: det2geo, which detects the set of loca-tions of objects belonging to a given category, and geo2cat, which computes the fine-grained category of the object at a given location. We introduce a solution that adapts state-ofthe-art CNN-based object detectors and classifiers. We test our method on “Pasadena Urban Trees”, a new dataset of 80,000 trees with geographic and species annotations, and show that combining multiple views significantly improves both tree detection and tree species classification, rivaling human performance.