计算机技术与发展2026,Vol.36Issue(3):1-10,10.DOI:10.20165/j.cnki.ISSN1673-629X.2025.0256
视觉-语言模型研究综述
A Survey of Visual-language Models
摘要
Abstract
In recent years,with the rapid development of multimodal learning,Visual-Language Models(VLMs)have demonstrated significant performance advantages in cross-modal tasks such as image captioning and visual question answering.By combining visual and linguistic information and leveraging large-scale image-text pairs available from the internet for pretraining,VLMs have become a research hotspot in the field.However,systematic reviews of VLMs,especially those including performance comparisons,analysis,and comprehensive reviews of end-to-end training processes,remain scarce.Therefore,we provide a comprehensive overview of the latest advancements in VLMs as of 2025,covering:classification and discussion of original text and image feature processing methods;classification and review of mainstream modal interaction strategies;review and discussion of classic and cutting-edge model architectures;a systematic summary of popular VLMs;benchmarking and discussion of current transfer learning methods in terms of per-formance,and domain generalization.The three future research directions are proposed.关键词
视觉语言模型/图像文本预训练/视觉语言学习/多模态/迁移学习Key words
visual-language models/image-text pretraining/visual-language learning/multimodal/transfer learning分类
信息技术与安全科学引用本文复制引用
马翌硕,张光南,刘亚婷,闫迪,陈冬,刘星愿,郭帅..视觉-语言模型研究综述[J].计算机技术与发展,2026,36(3):1-10,10.基金项目
陕西省重点研发计划项目(2024GX-YBXM-104) (2024GX-YBXM-104)