机器人2025,Vol.47Issue(3):416-426,11.DOI:10.13973/j.cnki.robot.240283
基于视觉语言模型的多模态无人机跨视图地理定位
Multimodal Drone Cross-view Geo-localization Based on Vision-language Model
摘要
Abstract
Cross-view geo-localization for drones achieves autonomous positioning by matching onboard images with geo-referenced images in satellite-denied conditions,with the primary challenge lying in the significant appearance differences across cross-view images.Existing methods predominantly focus on local feature extraction while lacking in-depth explo-ration of contextual correlations and global semantics.To address this problem,a vision-language model based multimodal drone cross-view geo-localization framework is proposed in this paper.Leveraging the CLIP(contrastive language-image pre-training)model,a view text description generation module is constructed,which utilizes image-level visual concepts learned from large-scale datasets as external knowledge to guide the feature extraction process.A hybrid vision transformer(ViT)architecture is adopted as the backbone network,enabling the model to simultaneously capture local features and global contextual characteristics during image feature extraction.Furthermore,a mutual learning loss supervised by logic score-normalized Kullback-Leibler(KL)divergence is introduced to optimize the training process,in order to enhance the model ability to learn inter-view correlations.Experimental results demonstrate that under the guidance of text descrip-tions generated by the CLIP model,the proposed model learns deep semantic information more effectively,thereby better addressing challenges such as viewpoint variations and temporal discrepancies encountered in cross-view geo-localization.关键词
跨视图地理定位/视觉语言模型/多模态/图像匹配/无人机Key words
cross-view geo-localization/vision-language model/multimodal/image matching/drone引用本文复制引用
陈鹏,陈旭,罗文,林斌..基于视觉语言模型的多模态无人机跨视图地理定位[J].机器人,2025,47(3):416-426,11.基金项目
国家自然科学基金(U20A20201). (U20A20201)