计算机应用研究2025,Vol.42Issue(6):1648-1655,8.DOI:10.19734/j.issn.1001-3695.2024.10.0446
面向视觉-语言模型的递进互提示学习
ProgCoPL:progressive co-prompting learning for vision-language models
摘要
Abstract
The large-scale pre-trained vision-language model CLIP aligns images and texts in a shared semantic space,demon-strating robust generalization capabilities across diverse downstream tasks.However,existing prompt learning methods often in-dependently insert learnable prompt vectors into each layer of CLIP's visual and text encoders.This approach results in limi-ted cross-modal interaction,with independent prompts across layers failing to effectively guide the encoders in capturing task-relevant information.To address these issues,this paper proposed ProgCoPL.This method introduced text-guided prompt vec-tors into the visual encoder layers and vision-guided prompt vectors into the text encoder layers,thereby enhancing cross-modal interaction and alignment.Furthermore,ProgCoPL incorporated information transmission channels between prompt vectors across layers,enabling hierarchical and progressive integration of task-specific information.Experiments on 11 datasets show that ProgCoPL efficiently adapts CLIP to downstream tasks,significantly improving its cross-dataset generalization ability.ProgCoPL outperforms existing methods in multiple generalization tests,particularly achieving notable advancements in cross-dataset scenarios.关键词
多模态/提示学习/视觉-语言模型/Transformer编码器Key words
multimodal/prompt learning/vision-language model/Transformer encoder分类
计算机与自动化引用本文复制引用
陶俊杰,张卫锋,王玉霞,缪翌,徐领..面向视觉-语言模型的递进互提示学习[J].计算机应用研究,2025,42(6):1648-1655,8.基金项目
中国博士后科学基金资助项目(2022M720569) (2022M720569)
浙江省自然科学基金资助项目(LQ21F020022) (LQ21F020022)