The Detector Teaches Itself:
Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

1Queen Mary University of London, UK
DAT Framework Overview
Figure 1. We bridge the Global-Local feature gap in VLMs. Our method (DAT) adapts the model using self-generated pseudo-labels, significantly improving novel object detection.

Abstract

Open-vocabulary object detection (OVD) aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection.

We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a pre-trained open-set detector, a closed-set detector, and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled.

We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.

Methodology

The DAT framework consists of two stages: Region-Aware Data Construction & Decoupled Fine-Tuning.

DAT Method Overview
Figure 2. Overview of the DAT methodology. We decoupled VLM visual backbone in fine-tuning phase while freeze other parameters.

Region-Aware Data

We construct a domain-aligned dataset by cropping object proposals to bridge the gap between image-level and region-level understanding. This creates perfectly aligned training data from the model's own predictions.

Decoupled Fine-Tuning

We fine-tune only the visual encoder while keeping other components frozen. To prevent catastrophic forgetting and preserve zero-shot capabilities, we employ weight-space ensembling techniques.

Experimental Results

Qualitative Comparison

Left: Baseline (CFM) Right: Ours (DAT)

Baseline Baseline Result 1
Ours Ours Result 1
Baseline Baseline Result 2
Ours Ours Result 2
Baseline Baseline Result 3
Ours Ours Result 3
Baseline Baseline Result 4
Ours Ours Result 4
Baseline Baseline Result 5
Ours Ours Result 5
Baseline Baseline Result 6
Ours Ours Result 6
Baseline Baseline Result 7
Ours Ours Result 7
Baseline Baseline Result 8
Ours Ours Result 8
Baseline Baseline Result 9
Ours Ours Result 9
Baseline Baseline Result 10
Ours Ours Result 10
Baseline Baseline Result 11
Ours Ours Result 11
Baseline Baseline Result 12
Ours Ours Result 12

Figure 3. Qualitative comparison on COCO novel categories. (b) Baseline (CFM): Misses novel objects or produces low-confidence detections. (a) Ours (DAT): Successfully detects novel objects with higher confidence and better localization.

Performance comparison on COCO with other open-vocabulary detectors.
Method Pre-training Detection Training Data AP50
Novel Base All
End-to-end models
OV-DETR COCO, CLIP 29.4 61.0 52.7
DetCLIPv3 FILIP, BLIP, GPT-4 O365, V3Det, GranuCap50M 54.7 42.8 46.9
Cooperative models
ViLD COCO, CLIP 27.6 59.5 51.3
Detic ImageNet-21K COCO, IL, CC, CLIP 27.8 47.1 45.0
BARON SOCO dataset COCO, CLIP 34.0 60.4 53.5
CORA COCO, CLIP 35.1 35.5 35.4
BARON SOCO, MAVL COCO, CC, CLIP 42.7 54.9 51.7
CORA+ COCO, CC, CLIP 43.1 60.9 56.2
CFM GDINO, SAM, CLIP COCO 50.3 49.8 49.9
Ours GDINO, SAM, CLIPft COCO 70.1 55.5 59.3