| 注册
首页|期刊导航|计算机应用研究|反向聚焦细粒度多模态语义对齐的视频字幕模型

反向聚焦细粒度多模态语义对齐的视频字幕模型

蔡霞 罗会兰 万斯奇

计算机应用研究2025,Vol.42Issue(7):1986-1993,8.
计算机应用研究2025,Vol.42Issue(7):1986-1993,8.DOI:10.19734/j.issn.1001-3695.2024.11.0492

反向聚焦细粒度多模态语义对齐的视频字幕模型

Reverse-focus fine-grained multimodal semantic alignment for video captioning

蔡霞 1罗会兰 1万斯奇1

作者信息

  • 1. 江西理工大学信息工程学院,江西赣州 341100
  • 折叠

摘要

Abstract

Existing video captioning often introduce multimodal information to assist models in extracting critical and fine-grained details from complex and dynamic visual content.However,these methods tend to overlook the semantic gaps caused by representational differences among modalities.To bridge these gaps,facilitate effective cross-modal alignment and efficient fu-sion,and enhance the extraction of fine-grained semantic information,this paper proposed a reverse-focus fine-grained multimo-dal semantic alignment for video captioning(RM4Cap).This model combined an image-text pair corpus and facilitated seman-tic alignment between video and image,indirectly aligning video representations with text in the image-text pairs.And it de-signed a reverse attention focusing algorithm to suppress redundant scene information while highlighting inconspicuous objects and their interactions.Experiments conducted on the MSVD and MSRVTT datasets show that the model significantly outper-forms existing methods in metrics such as CIDEr and BLEU-4.It effectively resolves the alignment challenges and redundancy issues in multimodal fusion,further demonstrating its ability to narrow the cross-modal semantic gap.

关键词

视频字幕/多模态/反向注意力/语义对齐/语义鸿沟

Key words

video captioning/multimodal/reverse attention/semantic alignment/semantic gap

分类

信息技术与安全科学

引用本文复制引用

蔡霞,罗会兰,万斯奇..反向聚焦细粒度多模态语义对齐的视频字幕模型[J].计算机应用研究,2025,42(7):1986-1993,8.

基金项目

国家自然科学基金资助项目(62361032) (62361032)

江西省主要学科技术带头人领军人才计划资助项目(20213BCJ22004) (20213BCJ22004)

江西省自然科学基金重点项目(20232ACB202011) (20232ACB202011)

多维智能感知与控制江西省重点实验室资助项目(2024SSY03161) (2024SSY03161)

江西省研究生创新专项资金资助项目(YC2023-S657) (YC2023-S657)

计算机应用研究

OA北大核心

1001-3695

访问量0
|
下载量0
段落导航相关论文