首页|期刊导航|信号处理|结合状态空间模型和Transformer的时空增强视频字幕生成

结合状态空间模型和Transformer的时空增强视频字幕生成

孙昊英李树一习泽宇毋立芳

信号处理2025，Vol.41Issue(2)：279-289,11.

信号处理2025，Vol.41Issue(2)：279-289,11.DOI:10.12466/xhcl.2025.02.007

结合状态空间模型和Transformer的时空增强视频字幕生成

Spatiotemporal Enhancement of Video Captioning Integrating a State Space Model and Transformer

孙昊英 ¹李树一 ¹习泽宇 ¹毋立芳¹

作者信息

1. 北京工业大学信息科学技术学院,北京 100124
折叠

摘要

Abstract

Video captioning aims to describe the content of videos using natural language,offering extensive applica-tions in areas such as human-computer interaction,assistance for visually impaired individuals,and sports commentary.However,the complex spatiotemporal variations within videos make it challenging to generate accurate captions.Previ-ous methods have attempted to enhance caption quality by extracting spatiotemporal features and leveraging prior infor-mation.Despite these efforts,they often struggle with spatiotemporal joint modeling,which can lead to inadequate vi-sual information extraction and negatively impact the quality of generated captions.To address this challenge,we pro-pose a novel model,ST2,which enhances spatiotemporal joint modeling capabilities by incorporating Mamba—a re-cently popular state-space model(SSM)known for its global receptive field and linear computational complexity.By combining Mamba with the Transformer framework,we introduce a Spatially Enhanced SSM and Transformer(SH-ST)that overcomes the receptive field limitations of convolutional approaches while reducing computational complex-ity,thereby improving the model's ability to extract spatial information.To further strengthen temporal modeling,we utilize Mamba's temporal scanning characteristics in conjunction with the global modeling capabilities of the Trans-former.This results in a Temporally Enhanced SSM and Transformer(TH-ST).Specifically,the features generated by SH-ST are reordered to allow Mamba to enhance the temporal relationships of these rearrange features through cross-scanning,after which the Transformer is employed to further bolster temporal modeling capabilities.Experimental re-sults validate the effectiveness of the SH-ST and TH-ST structural designs within our ST2 model,achieving competitive results on widely used video captioning datasets,MSVD and MSR-VTT.Notably,our method surpasses state-of-the-art results,achieving a 6.9% and 2.6% improvement in absolute CIDEr scores on the MSVD and MSR-VTT datasets,re-spectively,and exceeding baseline results by 4.9% in absolute CIDEr scores on MSVD.

关键词

视频字幕生成/视频理解/状态空间模型/Transformer

Key words

video captioning/video understanding/state space model/Transformer

分类

信息技术与安全科学

引用本文复制引用

孙昊英,李树一,习泽宇,毋立芳..结合状态空间模型和Transformer的时空增强视频字幕生成[J].信号处理,2025,41(2):279-289,11.

基金项目

国家自然科学基金(62236010,62306021)The National Natural Science Foundation of China(62236010,62306021) （62236010,62306021）

信号处理

OA北大核心

ISSN：1003-0530

访问量0

下载量0

段落导航