首页|期刊导航|数据采集与处理|基于细粒度视觉与音视双分支融合的情感视频字幕生成

基于细粒度视觉与音视双分支融合的情感视频字幕生成

龚禹轩韩婷婷

数据采集与处理2025，Vol.40Issue(5)：1165-1176,12.

数据采集与处理2025，Vol.40Issue(5)：1165-1176,12.DOI:10.16337/j.1004-9037.2025.05.005

基于细粒度视觉与音视双分支融合的情感视频字幕生成

Emotional Video Captioning Based on Fine-Grained Visual and Audio-Visual Dual-Branch Fusion

龚禹轩 ¹韩婷婷¹

作者信息

1. 杭州电子科技大学计算机学院,杭州 310018
折叠

摘要

Abstract

Emotional video captioning,as a cross-modal task integrating visual semantic parsing and emotional perception,faces the core challenge of accurately capturing the emotional cues embedded in visual content.Existing methods have two notable limitations:First,they insufficiently explore the fine-grained semantic correlations between video subjects(such as humans and objects)and their appearance and motion features,leading to a lack of refined support for visual content understanding;second,they neglect the auxiliary value of the audio modality in emotional discrimination and content semantic alignment,which restricts the comprehensive utilization of cross-modal information.To address these issues,this paper proposes a framework based on fine-grained visual and audio-visual dual-branch fusion.Specifically,the fine-grained visual feature fusion module effectively models the fine-grained semantic associations between video entities and visual contexts through pairwise interactions and deep integration of visual,object,and motion features,thereby achieving refined parsing of video content.The audio-visual dual-branch global fusion module constructs a cross-modal interaction channel to deeply fuse the integrated visual features with audio features,fully leveraging the supplementary role of audio information in emotional cue transmission and semantic constraint.Validation experiments on public benchmark datasets show that the proposed method outperforms comparative methods such as CANet and EPAN across evaluation metrics.It achieves an average improvement of 4%over EPAN method in emotional metrics,an average increase of 0.5 in semantic metrics,and an average boost of 0.7 in comprehensive metrics.Experimental results demonstrate that the proposed method can effectively enhance the quality of emotional video captioning.

关键词

情感视频字幕生成/跨模态情感感知/细粒度特征融合/注意力机制/视频理解

Key words

emotional video captioning/cross-modal emotional perception/fine-grained feature fusion/attention mechanism/video understanding

分类

信息技术与安全科学

引用本文复制引用

龚禹轩,韩婷婷..基于细粒度视觉与音视双分支融合的情感视频字幕生成[J].数据采集与处理,2025,40(5):1165-1176,12.

数据采集与处理

OA北大核心

ISSN：1004-9037

访问量0

下载量0

段落导航