计算机工程与应用2025,Vol.61Issue(4):176-191,16.DOI:10.3778/j.issn.1002-8331.2309-0403
应用动态Token的融合特征的持续图像字幕生成
Continual Image Captioning with Dynamic Token-Used Fusion Feature
摘要
Abstract
Architectures based on self-attention mechanisms,such as Transformer,exhibit outstanding performance advan-tages in image captioning tasks.However,in the majority of these approaches,models are only trained on static and identi-cally distributed datasets.In reality,data distributions are mostly non-independent and non-identically distributed data streams,making the task of continual image captioning under such settings more challenging.Notably,there is limited research on continual learning for multi-modal tasks like image captioning,and there is a lack of continual image captioning methods that are well-suited for self-attention-based models.To address these challenges,a continual image captioning method with a dynamic Token-based fusion feature is proposed.Different modalities of data features involved in image captioning tasks are fused within the Transformer.Regularization is applied to the fusion features.A Token is designated for each sub-task,and the Token will change as subtasks are switched,which is called the dynamic Token.Compared with the static Token that is defined only once in the whole training phase and is shared by all subtasks,the dynamic Token is better to preserve the information and characteristics specific to each subtask.Using these dynamic task Token and task identity fusion features attention module further obtains fusion features with task identity information and saves its corresponding Token after the training of each subtask to maintain the model's memory and expressive capabilities for previous tasks and reduce catastrophic forgetting.Experimental results on the MS-COCO and Flickr30k datasets demonstrate the superiority of the continual image captioning method with a dynamic Token-used fusion feature within the Transformer architecture over all baseline methods.For instance,in terms of the CIDEr metric,the average score of the CIDEr metric after completing all training tasks are increased by 31.06%compared to fine-tuning and 13.94%com-pared to the best-performing baseline method among all the baseline methods.关键词
图像字幕生成/持续学习/Transformer/融合特征/动态Token/正则化Key words
image captioning/continual learning/Transformer/fusion feature/dynamic Token/regularization分类
计算机与自动化引用本文复制引用
晋嘉利,余璐..应用动态Token的融合特征的持续图像字幕生成[J].计算机工程与应用,2025,61(4):176-191,16.基金项目
国家自然科学基金青年项目(62202331) (62202331)
天津理工大学校级研究生科研创新实践项目(YJ2246). (YJ2246)