Pixels to Places: Improving Zero-Shot Image Geolocalization using Prior Knowledge
Author(s)
Cha, Miriam; Borg, Trent
DownloadTechnical Report (704.9Kb)
Metadata
Show full item recordAbstract
The ability to predict the geographic origin of
a photo is critical for open-source investigation applications.
However, image geolocalization is highly challenging due to
the vast diversity of images captured worldwide. While vision
transformer-based approaches have demonstrated success—
even outperforming grandmasters in geolocation games like
GeoGuessr—their performance does not generalize well to unseen
locations. Prior methods rely solely on visual cues, neglecting
broader contextual knowledge that image analysts typically
employ. To bridge this gap, our research integrates the contextual
understanding of geographic regions that imagery analysts
possess into the geolocalization model. Specifically, we develop a
variant of StreetCLIP, which embeds CLIP within geolocalization
tasks and facilitates the incorporation of user-supplied prior
knowledge such as continental or national boundaries. Our
results on the IM2GPS3K benchmark dataset demonstrate a
10.66% improvement in regional prediction (within 200 km)
and a 15.27% improvement in country-level prediction (within
750 km) over baseline models. Our results suggest that humanprovided
supervision can enhance image geolocalization accuracy,
highlighting the potential of interactive systems where human
expertise and AI work collaboratively to refine predictions.
Index Terms—image geolocalization, CLIP, human-machine
teaming, vision transformers
Date issued
2025-09-10Department
Lincoln LaboratoryKeywords
MIT Lincoln Laboratory, LLSC, Convolutional Neural Networks