基于CLIP-SAM的跨模态特征对齐开放词汇目标检测

王崇文; 徐浩; 郑治伟

doi:10.15918/j.tbit1001-0645.2025.159

基于CLIP-SAM的跨模态特征对齐开放词汇目标检测

Open-Vocabulary Object Detection via Cross-Modal Feature Alignment with CLIP and SAM

摘要

摘要: 现实世界的开放应用场景催生了开放词汇目标检测（open vocabulary object detection，OVD），以克服传统检测方法的类别限制. 近年来，利用大模型先验知识实现区域−文本对齐已成为OVD的主流技术路径，但也带来了类别空间受限、计算成本高和领域差异等问题. 为此，本文提出一种跨模态特征对齐方法，通过双向知识迁移架构融合CLIP的语义先验与SAM的空间感知能力，将“CLIP+SAM”范式从图像分割拓展至目标检测任务，并且将跨模态融合过程设计为可插拔组件，在编码和解码阶段引入多模态交互，从而在控制计算成本的同时提升性能. 实验表明，在COCO数据集新类别上，所提模型的AP50达到44.1，优于多数现有方法，并与最优结果相当；所引入模块仅增加10.3%的参数量，未显著增加训练或推理成本.

Abstract: The growing demand for applications in open-world scenarios has driven progress in Open Vocabulary Object Detection (OVD) to overcome category constraints of traditional detection methods. While leveraging large models for region-text alignment has become a common OVD approach in recent years, it faces challenges such as limited category space, high computational cost, and domain gaps. In this paper, a cross-modal feature alignment method for OVD was proposed. Building on the CLIP and SAM models, semantic knowledge of CLIP was integrated with spatial perception of SAM through a bidirectional knowledge transfer architecture, extending “CLIP+SAM” from segmentation to detection. The cross-modal fusion module was plug-and-play and enabled rich multi-modal interaction during encoding and decoding, improving performance with minimal computational overhead. Experiments on COCO novel categories achieved an AP50 of 44.1, outperforming most existing methods and matching state-of-the-art results, while only increasing parameters by 10.3% without significantly raising training or inference costs.

HTML全文

参考文献(33)

施引文献

资源附件(0)