计算机工程与应用2024,Vol.60Issue(17):89-97,9.DOI:10.3778/j.issn.1002-8331.2306-0024
改进的密集视频描述Transformer译码算法
Improved Transformer Decoding Algorithm for Dense Video Description
摘要
Abstract
When applying Transformer for dense video description,historical text features can interfere with subsequent text generation,making it difficult to capture dynamic video information and affecting the coherence and accuracy of the descriptions.To maintain context consistency while mitigating historical text noise,this paper proposes an improved Transformer decoding algorithm for dense video description,called D-Uformer.This algorithm utilizes feedforward neural network(FNN)to enhance the representation of historical text features.It constructs pruning branches to remove redun-dant information and compensatory branches to enhance contextual information through skip connections,and uses sub-traction to reduce the impact of inaccurate descriptions caused by over-focusing on historical text features and improves the model's attention to input video features.Additionally,it uses addition to compensate for the loss of contextual infor-mation during feature transfer,and generates accurate and coherent descriptions of the current video content.Experi-mental results on the ActivityNet and Charades datasets demonstrate a significant performance improvement of the D-Uformer algorithm.Compared to the temporally descriptive probabilistic captioning(TDPC)network,it achieves a maxi-mum accuracy improvement of 4.816%and a maximum diversity improvement of 4.167%.The generated descriptions not only align better with the video content but also conform more to human language conventions.关键词
密集视频描述/Transformer网络/译码/前馈神经网络/跳跃连接Key words
dense video description/Transformer network/decoding/feedforward neural network/skip connection分类
信息技术与安全科学引用本文复制引用
杨大伟,盘晓芳,毛琳,张汝波..改进的密集视频描述Transformer译码算法[J].计算机工程与应用,2024,60(17):89-97,9.基金项目
国家自然科学基金(61673084) (61673084)
辽宁省自然科学基金(20180550866,2020-MZLH-24). (20180550866,2020-MZLH-24)