南京信息工程大学学报2026,Vol.18Issue(1):35-47,13.DOI:10.13878/j.cnki.jnuist.20250107001
基于细粒度特征增强的多模态视觉问答研究
Multimodal visual question answering based on fine-grained feature enhancement
摘要
Abstract
Existing multimodal Visual Question Answering(VQA)models often overlook the fine-grained interac-tion between local salient regions in images and local basic words in texts,thus the semantic relevance between ima-ges and texts needs to be improved.Here,we propose a novel multimodal VQA framework based on fine-grained fea-ture enhancement.First,a fine-grained feature extraction approach is added to both the visual and textual modalities to capture more detailed semantic features from images and questions.Then,to utilize cross-modal alignment cues,an alignment-guided self-attention module is proposed to align the correspondence between fine-grained features and global semantic features within each single modality(visual or text),and fuse unimodal information at different lev-els in a unified way.Finally,experiments are conducted on VQA v2.0 and VQA-CP v2 datasets.The results show that the proposed method outperforms existing models across multiple evaluation metrics.关键词
视觉问答/多模态/细粒度/特征增强/实体对齐/特征融合Key words
visual question answering(VQA)/multimodality/fine-grained/feature enhancement/entity alignment/feature fusion分类
信息技术与安全科学引用本文复制引用
王志伟,陆振宇..基于细粒度特征增强的多模态视觉问答研究[J].南京信息工程大学学报,2026,18(1):35-47,13.基金项目
资助项目浙江省自然科学基金联合基金项目(LZJMD25D050002) (LZJMD25D050002)
国家自然科学基金联合重点项目(U20B2061) (U20B2061)