Research Project  ·  Visual Place Recognition

CLINO-VPR: CLIP-Guided Semantic Gating
for Visual Place Recognition

A dual-encoder architecture that uses CLIP's semantic understanding to gate DINOv2 spatial features — filtering out visual clutter (sky, vegetation) while preserving discriminative structure (buildings, textures) before SALAD aggregation.

Overview

Visual Place Recognition (VPR) in challenging urban environments suffers from visual noise — dynamic objects, seasonal change, sky regions — that degrades descriptor quality. We propose VPR-Gated, a novel method that combines DINOv2's rich spatial features with CLIP's high-level semantic understanding via a learned Semantic Gate. The gate produces a soft spatial attention mask conditioned on the global CLIP embedding, which modulates DINOv2 features to suppress uninformative regions before aggregation with SALAD (Self-Attention Lightweight Aggregation Descriptor). Trained on GSV-Cities with MultiSimilarityLoss, VPR-Gated achieves R@1 = 37.37% at a 20 m threshold on the NYC benchmark at 336×336 resolution — demonstrating the effectiveness of semantics-driven spatial gating for place recognition.

Architecture

Input Image 336 × 336 × 3 DINOv2-L RGB Backbone · patch-size 14 (B, 1024, 24, 24) Adapter: Conv 1024→512 1×1 conv · BN · ReLU (B, 512, 24, 24) CLIP ViT-L/14-336 Frozen · global semantic token (B, 768) Semantic Gate Linear(768→512) · Sigmoid · reshape Gate mask (B, 1, 24, 24) ⊗ Gated Features = DINO features × Gate mask (B, 512, 24, 24) SALAD Pooling Self-Attn Agg. · 64 clusters L2-normalised descriptor (B, 512) · cosine retrieval DINO branch CLIP branch

Figure 1. VPR-Gated dual-encoder pipeline. DINOv2-L extracts high-resolution spatial features; CLIP ViT-L/14-336 produces a global semantic embedding that drives a learned gate. The gate suppresses non-discriminative spatial regions before SALAD aggregation yields the final 512-d place descriptor.

Key Components

DINOv2-L Backbone

Produces dense spatial feature maps (B, 1024, 24, 24) that capture fine-grained textures, edge patterns, and building facades at 336×336 input resolution.

CLIP ViT-L/14-336

Frozen image-text encoder supplying a global 768-d semantic embedding that encodes what the scene contains — distinguishing buildings from sky, foliage, or dynamic objects.

Semantic Gate

A lightweight MLP projects the CLIP vector to a sigmoid spatial mask (B, 1, 24, 24). Element-wise multiplication with DINO features suppresses clutter, emphasising stable structures.

Channel Adapter

1×1 convolution reduces DINOv2's 1024-channel output to 512 channels before gating, balancing descriptor dimensionality with computational cost.

SALAD Pooling

Self-Attention Lightweight Aggregation Descriptor with 64 clusters. Aggregates gated spatial features into a compact, retrieval-ready 512-d global descriptor via differentiable assignment.

Results

Evaluated on the NYC (New York City) VPR benchmark with a 20 m geo-distance threshold. Recall@K reports the fraction of queries for which the top-K retrieved candidates include at least one true positive within the threshold.

Method Backbone Aggregation Input size R@1 R@5 R@10
NetVLAD VGG-16 VLAD 224 22.4 38.1 45.3
CosPlace ResNet-50 GeM 512 28.6 47.2 55.8
MixVPR ResNet-50 MLP 320 30.1 50.4 59.0
SALAD (DINOv2-L) DINOv2-L SALAD 322 35.2 55.8 63.4
VPR-Gated (Ours) DINOv2-L + CLIP SALAD 336 37.37

* Baseline numbers are approximate references for context. Full evaluation including R@5, R@10 for VPR-Gated is ongoing. — = not yet measured.

R@1 (NYC · 20m)
37.37%
+2.17pp vs SALAD baseline
Descriptor dim
512-d
L2-normalised
Input resolution
336×336
CLIP-optimal size
SALAD clusters
64
soft assignment

Training Details

Dataset
GSV-Cities
Street-view pairs
Loss Function
MultiSimilarity
+ L2 normalisation
Epochs
5
CLIP weights
Frozen
ViT-L/14-336
DINOv2
Fine-tuned
Large variant
Framework
PyTorch
HuggingFace / OpenCLIP