信号处理2025,Vol.41Issue(2):279-289,11.DOI:10.12466/xhcl.2025.02.007
结合状态空间模型和Transformer的时空增强视频字幕生成
Spatiotemporal Enhancement of Video Captioning Integrating a State Space Model and Transformer
摘要
Abstract
Video captioning aims to describe the content of videos using natural language,offering extensive applica-tions in areas such as human-computer interaction,assistance for visually impaired individuals,and sports commentary.However,the complex spatiotemporal variations within videos make it challenging to generate accurate captions.Previ-ous methods have attempted to enhance caption quality by extracting spatiotemporal features and leveraging prior infor-mation.Despite these efforts,they often struggle with spatiotemporal joint modeling,which can lead to inadequate vi-sual information extraction and negatively impact the quality of generated captions.To address this challenge,we pro-pose a novel model,ST2,which enhances spatiotemporal joint modeling capabilities by incorporating Mamba—a re-cently popular state-space model(SSM)known for its global receptive field and linear computational complexity.By combining Mamba with the Transformer framework,we introduce a Spatially Enhanced SSM and Transformer(SH-ST)that overcomes the receptive field limitations of convolutional approaches while reducing computational complex-ity,thereby improving the model's ability to extract spatial information.To further strengthen temporal modeling,we utilize Mamba's temporal scanning characteristics in conjunction with the global modeling capabilities of the Trans-former.This results in a Temporally Enhanced SSM and Transformer(TH-ST).Specifically,the features generated by SH-ST are reordered to allow Mamba to enhance the temporal relationships of these rearrange features through cross-scanning,after which the Transformer is employed to further bolster temporal modeling capabilities.Experimental re-sults validate the effectiveness of the SH-ST and TH-ST structural designs within our ST2 model,achieving competitive results on widely used video captioning datasets,MSVD and MSR-VTT.Notably,our method surpasses state-of-the-art results,achieving a 6.9% and 2.6% improvement in absolute CIDEr scores on the MSVD and MSR-VTT datasets,re-spectively,and exceeding baseline results by 4.9% in absolute CIDEr scores on MSVD.关键词
视频字幕生成/视频理解/状态空间模型/TransformerKey words
video captioning/video understanding/state space model/Transformer分类
信息技术与安全科学引用本文复制引用
孙昊英,李树一,习泽宇,毋立芳..结合状态空间模型和Transformer的时空增强视频字幕生成[J].信号处理,2025,41(2):279-289,11.基金项目
国家自然科学基金(62236010,62306021)The National Natural Science Foundation of China(62236010,62306021) (62236010,62306021)