计算机应用研究2025,Vol.42Issue(7):1986-1993,8.DOI:10.19734/j.issn.1001-3695.2024.11.0492
反向聚焦细粒度多模态语义对齐的视频字幕模型
Reverse-focus fine-grained multimodal semantic alignment for video captioning
摘要
Abstract
Existing video captioning often introduce multimodal information to assist models in extracting critical and fine-grained details from complex and dynamic visual content.However,these methods tend to overlook the semantic gaps caused by representational differences among modalities.To bridge these gaps,facilitate effective cross-modal alignment and efficient fu-sion,and enhance the extraction of fine-grained semantic information,this paper proposed a reverse-focus fine-grained multimo-dal semantic alignment for video captioning(RM4Cap).This model combined an image-text pair corpus and facilitated seman-tic alignment between video and image,indirectly aligning video representations with text in the image-text pairs.And it de-signed a reverse attention focusing algorithm to suppress redundant scene information while highlighting inconspicuous objects and their interactions.Experiments conducted on the MSVD and MSRVTT datasets show that the model significantly outper-forms existing methods in metrics such as CIDEr and BLEU-4.It effectively resolves the alignment challenges and redundancy issues in multimodal fusion,further demonstrating its ability to narrow the cross-modal semantic gap.关键词
视频字幕/多模态/反向注意力/语义对齐/语义鸿沟Key words
video captioning/multimodal/reverse attention/semantic alignment/semantic gap分类
信息技术与安全科学引用本文复制引用
蔡霞,罗会兰,万斯奇..反向聚焦细粒度多模态语义对齐的视频字幕模型[J].计算机应用研究,2025,42(7):1986-1993,8.基金项目
国家自然科学基金资助项目(62361032) (62361032)
江西省主要学科技术带头人领军人才计划资助项目(20213BCJ22004) (20213BCJ22004)
江西省自然科学基金重点项目(20232ACB202011) (20232ACB202011)
多维智能感知与控制江西省重点实验室资助项目(2024SSY03161) (2024SSY03161)
江西省研究生创新专项资金资助项目(YC2023-S657) (YC2023-S657)