CLINO-VPR: CLIP-Guided Semantic Gating for Visual Place Recognition

Abstract

Overview

Visual Place Recognition (VPR) in challenging urban environments suffers from visual noise — dynamic objects, seasonal change, sky regions — that degrades descriptor quality. We propose VPR-Gated, a novel method that combines DINOv2's rich spatial features with CLIP's high-level semantic understanding via a learned Semantic Gate. The gate produces a soft spatial attention mask conditioned on the global CLIP embedding, which modulates DINOv2 features to suppress uninformative regions before aggregation with SALAD (Self-Attention Lightweight Aggregation Descriptor). Trained on GSV-Cities with MultiSimilarityLoss, VPR-Gated achieves R@1 = 37.37% at a 20 m threshold on the NYC benchmark at 336×336 resolution — demonstrating the effectiveness of semantics-driven spatial gating for place recognition.

Method

Architecture

Figure 1. VPR-Gated dual-encoder pipeline. DINOv2-L extracts high-resolution spatial features; CLIP ViT-L/14-336 produces a global semantic embedding that drives a learned gate. The gate suppresses non-discriminative spatial regions before SALAD aggregation yields the final 512-d place descriptor.

Components

Key Components

DINOv2-L Backbone

Produces dense spatial feature maps (B, 1024, 24, 24) that capture fine-grained textures, edge patterns, and building facades at 336×336 input resolution.

CLIP ViT-L/14-336

Frozen image-text encoder supplying a global 768-d semantic embedding that encodes what the scene contains — distinguishing buildings from sky, foliage, or dynamic objects.

Semantic Gate

A lightweight MLP projects the CLIP vector to a sigmoid spatial mask (B, 1, 24, 24). Element-wise multiplication with DINO features suppresses clutter, emphasising stable structures.

Channel Adapter

1×1 convolution reduces DINOv2's 1024-channel output to 512 channels before gating, balancing descriptor dimensionality with computational cost.

SALAD Pooling

Self-Attention Lightweight Aggregation Descriptor with 64 clusters. Aggregates gated spatial features into a compact, retrieval-ready 512-d global descriptor via differentiable assignment.

Evaluation

Results

Evaluated on the NYC (New York City) VPR benchmark with a 20 m geo-distance threshold. Recall@K reports the fraction of queries for which the top-K retrieved candidates include at least one true positive within the threshold.

Method	Backbone	Aggregation	Input size	R@1	R@5	R@10
NetVLAD	VGG-16	VLAD	224	22.4	38.1	45.3
CosPlace	ResNet-50	GeM	512	28.6	47.2	55.8
MixVPR	ResNet-50	MLP	320	30.1	50.4	59.0
SALAD (DINOv2-L)	DINOv2-L	SALAD	322	35.2	55.8	63.4
VPR-Gated (Ours)	DINOv2-L + CLIP	SALAD	336	37.37	—	—

* Baseline numbers are approximate references for context. Full evaluation including R@5, R@10 for VPR-Gated is ongoing. — = not yet measured.

R@1 (NYC · 20m)

37.37%

+2.17pp vs SALAD baseline

Descriptor dim

512-d

L2-normalised

Input resolution

336×336

CLIP-optimal size

SALAD clusters

64

soft assignment

Training

Training Details

Dataset

GSV-Cities

Street-view pairs

Loss Function

MultiSimilarity

+ L2 normalisation

Epochs

5

CLIP weights

Frozen

ViT-L/14-336

DINOv2

Fine-tuned

Large variant

Framework

PyTorch

HuggingFace / OpenCLIP

CLINO-VPR: CLIP-Guided Semantic Gatingfor Visual Place Recognition