计算机科学与探索2026,Vol.20Issue(3):760-772,13.DOI:10.3778/j.issn.1673-9418.2505064
面向遥感视觉问答的跨模态知识引入与提示推理框架
Cross-Modal Knowledge Introduction and Prompt Inference Framework for Remote Sensing Visual Question Answering
摘要
Abstract
With the rapid development of remote sensing technology,remote sensing visual question answering(RSVQA),as an emerging technology combining language and visual interaction,has effectively improved the efficiency of interpreting remote sensing image information and the interactive ability in fields such as earth observation and environmental monitoring.However,RSVQA still faces challenges such as high complexity of remote sensing image information,lack of remote sensing image-text alignment data,and diverse forms of text question expression.To address these challenges,this paper proposes a cross-modal knowledge introduction and prompt inference framework(CMKIP)for RSVQA.Specifically,for the high complexity of remote sensing images,CMKIP first builds a learnable image feature adapter for the large language model LLaMA to enable it to represent complex sensing images.Next,to address the problem of scarcity of remote sensing image-text alignment data,an automated data generation pipeline is constructed to generate high-quality image-text pairs from publicly available remote sensing datasets to realize efficient remote sensing domain knowledge injection.Finally,in view of the diversity of problem expressions,an innovative large and small model collaborative inference mechanism is proposed.This mechanism uses the small model to perform knowledge base retrieval and intermediate inference correction,effectively improving the understanding ability and reasoning accuracy of the large language model for diverse questions.In addition,CMKIP supports flexible replacement of small models according to task requirements and can be widely used in multiple downstream tasks in the remote sensing field.Experimental results show that CMKIP performs significantly better than existing methods on the RSVQA benchmark dataset,especially in low-sample scenarios,demonstrating its effectiveness and generalization in RSVQA tasks.关键词
遥感视觉问答/大语言模型/跨模态扩展/遥感微调指令集/轻量级模型/提示推理Key words
remote sensing visual question answering/large language model/cross-modal extension/remote sensing fine-tuning instruction set/light-weight model/prompt inference分类
信息技术与安全科学引用本文复制引用
董欣,俞鹏飞,顾晶晶..面向遥感视觉问答的跨模态知识引入与提示推理框架[J].计算机科学与探索,2026,20(3):760-772,13.基金项目
国家自然科学基金(62072235).This work was supported by the National Natural Science Foundation of China(62072235). (62072235)