首页|期刊导航|南京信息工程大学学报|基于细粒度特征增强的多模态视觉问答研究

基于细粒度特征增强的多模态视觉问答研究

王志伟陆振宇

南京信息工程大学学报2026，Vol.18Issue(1)：35-47,13.

南京信息工程大学学报2026，Vol.18Issue(1)：35-47,13.DOI:10.13878/j.cnki.jnuist.20250107001

基于细粒度特征增强的多模态视觉问答研究

Multimodal visual question answering based on fine-grained feature enhancement

王志伟 ¹陆振宇²

作者信息

1. 南京信息工程大学计算机学院,南京,210044
2. 南京信息工程大学人工智能学院,南京,210044
折叠

摘要

Abstract

Existing multimodal Visual Question Answering(VQA)models often overlook the fine-grained interac-tion between local salient regions in images and local basic words in texts,thus the semantic relevance between ima-ges and texts needs to be improved.Here,we propose a novel multimodal VQA framework based on fine-grained fea-ture enhancement.First,a fine-grained feature extraction approach is added to both the visual and textual modalities to capture more detailed semantic features from images and questions.Then,to utilize cross-modal alignment cues,an alignment-guided self-attention module is proposed to align the correspondence between fine-grained features and global semantic features within each single modality(visual or text),and fuse unimodal information at different lev-els in a unified way.Finally,experiments are conducted on VQA v2.0 and VQA-CP v2 datasets.The results show that the proposed method outperforms existing models across multiple evaluation metrics.

关键词

视觉问答/多模态/细粒度/特征增强/实体对齐/特征融合

Key words

visual question answering(VQA)/multimodality/fine-grained/feature enhancement/entity alignment/feature fusion

分类

信息技术与安全科学

引用本文复制引用

王志伟,陆振宇..基于细粒度特征增强的多模态视觉问答研究[J].南京信息工程大学学报,2026,18(1):35-47,13.

基金项目

资助项目浙江省自然科学基金联合基金项目(LZJMD25D050002) （LZJMD25D050002）

国家自然科学基金联合重点项目(U20B2061) （U20B2061）

南京信息工程大学学报

ISSN：1674-7070

访问量0

下载量0

段落导航