| 注册
首页|期刊导航|华东师范大学学报(自然科学版)|基于双路径多模态交互的一阶段视觉定位模型

基于双路径多模态交互的一阶段视觉定位模型

王月 叶加博 林欣

华东师范大学学报(自然科学版)Issue(2):65-75,11.
华东师范大学学报(自然科学版)Issue(2):65-75,11.DOI:10.3969/j.issn.1000-5641.2024.02.008

基于双路径多模态交互的一阶段视觉定位模型

Dual-path network with multilevel interaction for one-stage visual grounding

王月 1叶加博 1林欣1

作者信息

  • 1. 华东师范大学计算机科学与技术学院,上海 200062
  • 折叠

摘要

Abstract

This study explores the multimodal understanding and reasoning for one-stage visual grounding.Existing one-stage methods extract visual feature maps and textual features separately,and then,multimodal reasoning is performed to predict the bounding box of the referred object.These methods suffer from the following two weaknesses:Firstly,the pre-trained visual feature extractors introduce text-unrelated visual signals into the visual features that hinder multimodal interaction.Secondly,the reasoning process followed in these two methods lacks visual guidance for language modeling.It is clear from these shortcomings that the reasoning ability of existing one-stage methods is limited.We propose a low-level interaction to extract text-related visual feature maps,and a high-level interaction to incorporate visual features in guiding the language modeling and further performing multistep reasoning on visual features.Based on the proposed interactions,we present a novel network architecture called the dual-path multilevel interaction network(DPMIN).Furthermore,experiments on five commonly used visual grounding datasets are conducted.The results demonstrate the superior performance of the proposed method and its real-time applicability.

关键词

视觉定位/多模态推理/引用表达

Key words

visual grounding/multimodal understanding/referring expressions

分类

信息技术与安全科学

引用本文复制引用

王月,叶加博,林欣..基于双路径多模态交互的一阶段视觉定位模型[J].华东师范大学学报(自然科学版),2024,(2):65-75,11.

华东师范大学学报(自然科学版)

OA北大核心CSTPCD

1000-5641

访问量0
|
下载量0
段落导航相关论文