计算机工程与应用2025,Vol.61Issue(21):182-191,10.DOI:10.3778/j.issn.1002-8331.2407-0330
基于双Transformer结构的多模态视频段落描述生成研究
Research on Multi-Modal Video Paragraph Captioning Based on Dual-Transformer Structure
摘要
Abstract
To address the issues of insufficient focus on key events in videos and lack of coherence between multi-event descriptions in existing video paragraph description methods,this paper proposes a multimodal video paragraph descrip-tion model based on a dual-Transformer structure,built on the existing encoder-decoder framework.The model utilizes Faster-RCNN to extract fine-grained features from keyframes in the video,and a hybrid attention mechanism to combine global visual features,selecting the most representative fine-grained local visual features.The information of key events in the video are enhanced,improving the accuracy of content descriptions.Additionally,the paper introduces a memory module and hybrid attention module into the Transformer structure and designs a dual-Transformer architecture:the inter-nal Transformer models intra-event consistency,while the external Transformer uses hybrid attention to model inter-event consistency by calculating the relevance between the current event and other events.The output of both the internal and external Transformers is combined to predict the content of events,improving the coherence of generated descriptions.Experiments on the ActivityNet Captions and YouCookII datasets demonstrate that the proposed model significantly outperforms existing mainstream video paragraph description models on BLEU-4,METEOR,ROUGE-L,and CIDEr metrics,verifying the effectiveness of the model.关键词
视频段落描述/编码器-解码器结构/细粒度局部视觉特征/双Transformer结构Key words
video paragraph captioning/encoder-decoder framework/fine-grained local visual features/dual-Transformer structure分类
计算机与自动化引用本文复制引用
赵宏,张立军..基于双Transformer结构的多模态视频段落描述生成研究[J].计算机工程与应用,2025,61(21):182-191,10.基金项目
国家自然科学基金(62166025). (62166025)