Abstract:
The growing demand for applications in open-world scenarios has driven progress in Open Vocabulary Object Detection (OVD) to overcome category constraints of traditional detection methods. While leveraging large models for region-text alignment has become a common OVD approach in recent years, it faces challenges such as limited category space, high computational cost, and domain gaps. In this paper, a cross-modal feature alignment method for OVD was proposed. Building on the CLIP and SAM models, semantic knowledge of CLIP was integrated with spatial perception of SAM through a bidirectional knowledge transfer architecture, extending “CLIP+SAM” from segmentation to detection. The cross-modal fusion module was plug-and-play and enabled rich multi-modal interaction during encoding and decoding, improving performance with minimal computational overhead. Experiments on COCO novel categories achieved an AP50 of 44.1, outperforming most existing methods and matching state-of-the-art results, while only increasing parameters by 10.3% without significantly raising training or inference costs.