河北工业科技2025,Vol.42Issue(4):314-322,339,10.DOI:10.7535/hbgykj.2025yx04002
基于多模态预训练大模型和细粒度特征增强的图像中文描述
Chinese image caption based on multimodal pre-trained network and fine-grained feature enhancement
摘要
Abstract
To tackle the problems of poor performance and semantic alignment of existing image caption models on Chinese datasets,a Chinese image caption model based on a multimodal pre-trained large model and fine-grained feature enhancement was proposed.Firstly,the image encoder of CLIP was used to extract image features.Secondly,a multimodal mapping network based on the triple channel-mixing multilayer perceptron(TCM-MLP)module was employed.After expanding the image feature matrix to three times its original size in the channel dimension,spatial displacement was performed on the image feature matrix.Subsequently,the segmented attention mechanism was utilized to fuse the feature maps of the three branches.Finally,the GPT-2 model was used to generate the descriptive text word by word in an autoregressive manner.The results demonstrate that on the AIC-ICC dataset,the model achieves BLEU-1,BLEU-2,BLEU-3,BLEU-4,ROUGE,and METEOR scores of 0.827,0.747,0.677,0.605,0.686,and 0.591,respectively.In the Flickr8k-CN dataset,the scores are 0.710,0.546,0.427,0.325,0.515,and 0.363,respectively,all reflecting superior performance.This model effectively achieves the alignment between visual and semantic spaces,generating rich and accurate Chinese captions.The proposed model offers a novel solution to the problem of cross-modal semantic alignment and holds significant theoretical and practical value for advancing image understanding and caption tasks in Chinese contexts.关键词
自然语言处理/图像中文描述/TCM-MLP多模态映射网络/编解码器/预训练大模型Key words
natural language processing/Chinese image caption/TCM-MLP multimodal mapping network/encoding and decoding/pre-trained large model分类
信息技术与安全科学引用本文复制引用
马雯悦,王恒友,何强,曾宪佑..基于多模态预训练大模型和细粒度特征增强的图像中文描述[J].河北工业科技,2025,42(4):314-322,339,10.基金项目
国家自然科学基金(62072024) (62072024)