A dual-encoder architecture that uses CLIP's semantic understanding to gate DINOv2 spatial features — filtering out visual clutter (sky, vegetation) while preserving discriminative structure (buildings, textures) before SALAD aggregation.
Figure 1. VPR-Gated dual-encoder pipeline. DINOv2-L extracts high-resolution spatial features; CLIP ViT-L/14-336 produces a global semantic embedding that drives a learned gate. The gate suppresses non-discriminative spatial regions before SALAD aggregation yields the final 512-d place descriptor.
Produces dense spatial feature maps (B, 1024, 24, 24) that capture fine-grained textures, edge patterns, and building facades at 336×336 input resolution.
Frozen image-text encoder supplying a global 768-d semantic embedding that encodes what the scene contains — distinguishing buildings from sky, foliage, or dynamic objects.
A lightweight MLP projects the CLIP vector to a sigmoid spatial mask (B, 1, 24, 24). Element-wise multiplication with DINO features suppresses clutter, emphasising stable structures.
1×1 convolution reduces DINOv2's 1024-channel output to 512 channels before gating, balancing descriptor dimensionality with computational cost.
Self-Attention Lightweight Aggregation Descriptor with 64 clusters. Aggregates gated spatial features into a compact, retrieval-ready 512-d global descriptor via differentiable assignment.
Evaluated on the NYC (New York City) VPR benchmark with a 20 m geo-distance threshold. Recall@K reports the fraction of queries for which the top-K retrieved candidates include at least one true positive within the threshold.
| Method | Backbone | Aggregation | Input size | R@1 | R@5 | R@10 |
|---|---|---|---|---|---|---|
| NetVLAD | VGG-16 | VLAD | 224 | 22.4 | 38.1 | 45.3 |
| CosPlace | ResNet-50 | GeM | 512 | 28.6 | 47.2 | 55.8 |
| MixVPR | ResNet-50 | MLP | 320 | 30.1 | 50.4 | 59.0 |
| SALAD (DINOv2-L) | DINOv2-L | SALAD | 322 | 35.2 | 55.8 | 63.4 |
| VPR-Gated (Ours) | DINOv2-L + CLIP | SALAD | 336 | 37.37 | — | — |
* Baseline numbers are approximate references for context. Full evaluation including R@5, R@10 for VPR-Gated is ongoing. — = not yet measured.