首页|期刊导航|计算机工程与应用|应用动态Token的融合特征的持续图像字幕生成

应用动态Token的融合特征的持续图像字幕生成OA北大核心

Continual Image Captioning with Dynamic Token-Used Fusion Feature

中文摘要英文摘要

基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而变化,这种Token即为动态Token,相比于整个训练阶段只定义一个且被所有子任务共用的静态Token而言,动态Token更能保存每个子任务特有的信息和特点.利用这些动态任务Token和任务标识融合特征注意力模块进一步获得具有任务标识信息的融合特征,并在每个子任务训练结束后保存其对应的Token,以保持模型对旧任务的记忆和表达能力,减少模型对旧任务的灾难性遗忘.在MS-COCO和Flickr30k数据集上的实验结果表明,应用动态Token的融合特征的持续图像字幕生成方法在Transformer架构上优于所有基线方法.以CIDEr指标为例,所有训练任务结束后CIDEr指标的平均分数相较于微调和所有基线方法中的最优方法分别提高了31.06%和13.94%.

Architectures based on self-attention mechanisms,such as Transformer,exhibit outstanding performance advan-tages in image captioning tasks.However,in the majority of these approaches,models are only trained on static and identi-cally distributed datasets.In reality,data distributions are mostly non-independent and non-identically distributed data streams,making the task of continual image captioning under such settings more challenging.Notably,there is limited research on continual learning for multi-modal tasks like image captioning,and there is a lack of continual image captioning methods that are well-suited for self-attention-based models.To address these challenges,a continual image captioning method with a dynamic Token-based fusion feature is proposed.Different modalities of data features involved in image captioning tasks are fused within the Transformer.Regularization is applied to the fusion features.A Token is designated for each sub-task,and the Token will change as subtasks are switched,which is called the dynamic Token.Compared with the static Token that is defined only once in the whole training phase and is shared by all subtasks,the dynamic Token is better to preserve the information and characteristics specific to each subtask.Using these dynamic task Token and task identity fusion features attention module further obtains fusion features with task identity information and saves its corresponding Token after the training of each subtask to maintain the model's memory and expressive capabilities for previous tasks and reduce catastrophic forgetting.Experimental results on the MS-COCO and Flickr30k datasets demonstrate the superiority of the continual image captioning method with a dynamic Token-used fusion feature within the Transformer architecture over all baseline methods.For instance,in terms of the CIDEr metric,the average score of the CIDEr metric after completing all training tasks are increased by 31.06%compared to fine-tuning and 13.94%com-pared to the best-performing baseline method among all the baseline methods.

晋嘉利;余璐

天津理工大学 计算机科学与工程学院,天津 300384天津理工大学 计算机科学与工程学院,天津 300384

计算机与自动化

图像字幕生成持续学习Transformer融合特征动态Token正则化

image captioningcontinual learningTransformerfusion featuredynamic Tokenregularization

《计算机工程与应用》 2025 (4)

176-191,16

国家自然科学基金青年项目(62202331)天津理工大学校级研究生科研创新实践项目(YJ2246).

10.3778/j.issn.1002-8331.2309-0403

评论