Pixels to Places: Improving Zero-Shot Image Geolocalization using Prior Knowledge

Cha, Miriam; Borg, Trent

Author(s)

Cha, Miriam; Borg, Trent

DownloadTechnical Report (704.9Kb)

Metadata

Show full item record

Abstract

The ability to predict the geographic origin of a photo is critical for open-source investigation applications. However, image geolocalization is highly challenging due to the vast diversity of images captured worldwide. While vision transformer-based approaches have demonstrated success— even outperforming grandmasters in geolocation games like GeoGuessr—their performance does not generalize well to unseen locations. Prior methods rely solely on visual cues, neglecting broader contextual knowledge that image analysts typically employ. To bridge this gap, our research integrates the contextual understanding of geographic regions that imagery analysts possess into the geolocalization model. Specifically, we develop a variant of StreetCLIP, which embeds CLIP within geolocalization tasks and facilitates the incorporation of user-supplied prior knowledge such as continental or national boundaries. Our results on the IM2GPS3K benchmark dataset demonstrate a 10.66% improvement in regional prediction (within 200 km) and a 15.27% improvement in country-level prediction (within 750 km) over baseline models. Our results suggest that humanprovided supervision can enhance image geolocalization accuracy, highlighting the potential of interactive systems where human expertise and AI work collaboratively to refine predictions. Index Terms—image geolocalization, CLIP, human-machine teaming, vision transformers

Date issued

2025-09-10

URI

https://hdl.handle.net/1721.1/162630

Department

Lincoln Laboratory

Keywords

MIT Lincoln Laboratory, LLSC, Convolutional Neural Networks

Collections

Reports