FlowOVD: Learning Generative Latent Flows for
Zero-shot Open-vocabulary Detection

1Queen Mary University of London, UK
2École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

arXiv 2026

Traditionally, OVD is formulated as a discriminative prediction task through a Transformer encoder-decoder architecture. By introducing a generative perspective, our work attempts to move beyond conventional discrete query construction strategies.

Abstract

Open-vocabulary object detection (OVD) has achieved remarkable progress through large-scale vision-language pre-training. Existing methods, however, typically formulate OVD as a discriminative prediction problem, where decoder queries are either static or initialized from encoder features, thus limiting their diversity and flexibility. In this paper, we introduce a generative perspective by modeling decoder query generation as a continuous transport process in latent space. We propose FlowOVD, a text-conditioned query generation framework based on rectified flow that progressively transforms text-agnostic queries into text-guided queries. By introducing continuous latent query dynamics into a vision-language model (VLM) based detector, our method avoids heuristic discrete query construction and enables more expressive semantic alignment for open-vocabulary detection. Without requiring additional training data, FlowOVD achieves 49.5 AP on COCO and 31.5 AP on LVIS, outperforming GroundingDINO by +1.2 AP (+2.5 %) and +4.1 AP (+15.0 %), respectively. The larger gain on the challenging long-tailed LVIS benchmark further highlights the effectiveness of continuous query generation for open-vocabulary generalization.

Overview

We presented FlowOVD, a novel OVD approach that reformulates query initialization as a continuous generative process in the latent space. We propose a text-conditioned query flow that transforms an initial set of text-agnostic queries into a text-guided distribution. Specifically, we adopt a rectified flow formulation to model a time-dependent velocity field that transports queries under language conditioning. This enables controllable, and diverse query generation while remaining fully compatible with existing Transformer-based detectors.

(a) Overall framework. Given an image and a text prompt, a vision-language encoder extracts visual features and textual features. In the latent space, we present a query flow that transforms a set of text-agnostic queries into text-conditioned queries. The refined queries are then consumed by the transformer decoder to produce final predictions. (b) Query Flow. First, a set-level matching is performed between source queries and target queries. Intermediate states are obtained via linear interpolation. A velocity field is trained to learn the transport dynamics with conditions. The final queries are obtained by integrating the learned flow, resulting in more diverse and text-aligned query representations.

Result

Extensive experiments on COCO and LVIS show that our FlowOVD consistently improves open-vocabulary detection performance and achieves better efficiency with fewer decoder layers.

BibTeX

@article{wei2026flowovd,
      title={FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection},
      author={Wei, Yao and Cavallaro, Andrea and Oh, Changjae},
      journal={arXiv preprint arXiv:2606.00782},
      year={2026}}