基于CLIP-SAM的跨模态特征对齐开放词汇目标检测

Open-Vocabulary Object Detection via Cross-Modal Feature Alignment with CLIP and SAM

  • 摘要: 现实世界的开放应用场景催生了开放词汇目标检测(open vocabulary object detection,OVD),以克服传统检测方法的类别限制. 近年来,利用大模型先验知识实现区域−文本对齐已成为OVD的主流技术路径,但也带来了类别空间受限、计算成本高和领域差异等问题. 为此,本文提出一种跨模态特征对齐方法,通过双向知识迁移架构融合CLIP的语义先验与SAM的空间感知能力,将“CLIP+SAM”范式从图像分割拓展至目标检测任务,并且将跨模态融合过程设计为可插拔组件,在编码和解码阶段引入多模态交互,从而在控制计算成本的同时提升性能. 实验表明,在COCO数据集新类别上,所提模型的AP50达到44.1,优于多数现有方法,并与最优结果相当;所引入模块仅增加10.3%的参数量,未显著增加训练或推理成本.

     

    Abstract: The growing demand for applications in open-world scenarios has driven progress in Open Vocabulary Object Detection (OVD) to overcome category constraints of traditional detection methods. While leveraging large models for region-text alignment has become a common OVD approach in recent years, it faces challenges such as limited category space, high computational cost, and domain gaps. In this paper, a cross-modal feature alignment method for OVD was proposed. Building on the CLIP and SAM models, semantic knowledge of CLIP was integrated with spatial perception of SAM through a bidirectional knowledge transfer architecture, extending “CLIP+SAM” from segmentation to detection. The cross-modal fusion module was plug-and-play and enabled rich multi-modal interaction during encoding and decoding, improving performance with minimal computational overhead. Experiments on COCO novel categories achieved an AP50 of 44.1, outperforming most existing methods and matching state-of-the-art results, while only increasing parameters by 10.3% without significantly raising training or inference costs.

     

/

返回文章
返回
Baidu
map