华东师范大学学报(自然科学版)Issue(2):65-75,11.DOI:10.3969/j.issn.1000-5641.2024.02.008
基于双路径多模态交互的一阶段视觉定位模型
Dual-path network with multilevel interaction for one-stage visual grounding
王月 1叶加博 1林欣1
作者信息
- 1. 华东师范大学计算机科学与技术学院,上海 200062
- 折叠
摘要
Abstract
This study explores the multimodal understanding and reasoning for one-stage visual grounding.Existing one-stage methods extract visual feature maps and textual features separately,and then,multimodal reasoning is performed to predict the bounding box of the referred object.These methods suffer from the following two weaknesses:Firstly,the pre-trained visual feature extractors introduce text-unrelated visual signals into the visual features that hinder multimodal interaction.Secondly,the reasoning process followed in these two methods lacks visual guidance for language modeling.It is clear from these shortcomings that the reasoning ability of existing one-stage methods is limited.We propose a low-level interaction to extract text-related visual feature maps,and a high-level interaction to incorporate visual features in guiding the language modeling and further performing multistep reasoning on visual features.Based on the proposed interactions,we present a novel network architecture called the dual-path multilevel interaction network(DPMIN).Furthermore,experiments on five commonly used visual grounding datasets are conducted.The results demonstrate the superior performance of the proposed method and its real-time applicability.关键词
视觉定位/多模态推理/引用表达Key words
visual grounding/multimodal understanding/referring expressions分类
信息技术与安全科学引用本文复制引用
王月,叶加博,林欣..基于双路径多模态交互的一阶段视觉定位模型[J].华东师范大学学报(自然科学版),2024,(2):65-75,11.