华南理工大学学报(自然科学版)2026,Vol.54Issue(2):1-15,15.DOI:10.12141/j.issn.1000-565X.250152
基于多模态场景记忆与指令提示的目标导航方法
Target Navigation Method Based on Multimodal Scene Memory and Instruction Prompting
摘要
Abstract
Target navigation requires robots to autonomously plan paths and accurately reach specified target loca-tions based on natural language instructions or object categories in a working environment.Existing approaches to this task primarily fall into two categories in a working environmrnt:end-to-end learning and planning-based methods.While end-to-end methods can directly learn a mapping from perception to action,they often exhibit limited generalization capability and poor interpretability.Conversely,planning-based methods offer better generalization and interpretability to some extent;however,they are often not optimized for known environments,fail to exploit prompt information embedded in natural language instructions,struggle to achieve precise docking at a specified distance from the target,and generally suffer from low execution efficiency.To overcome these limitations,this paper proposed a novel target navigation method named MEMO-Nav,which leverages multimodal scene memory and instruction prompting to improve navigation performance in known environments.The proposed framework adopts a hierarchical architecture:a high-level planning layer maintains a multimodal scene memory to record envi-ronmental information and utilizes a Large Language Model(LLM)to parse target and prompt information from natu-ral language instructions.This information is then combined to enable efficient waypoint selection and navigation planning.A low-level execution layer handles fundamental navigation functions,including robot localization and movement,and integrates an object detection model with a depth camera to achieve accurate target positioning.Together,these two layers form a complete target navigation system,ultimately enabling the robot to locate the target and dock at a specified distance based on natural language instructions.Extensive experiments conducted on the GAZEBO simulation platform and in real-world settings demonstrate that the proposed method significantly outper-forms existing approaches in known environments across key metrics,including navigation efficiency,success rate,and docking distance accuracy.In summary,the proposed method offers a feasible,efficient,interpretable,and pre-cise solution for mobile robot target navigation in practical scenarios.关键词
移动机器人/目标导航/路径规划/大语言模型/多模态Key words
mobile robot/target navigation/path planning/large language model/multimodal分类
信息技术与安全科学引用本文复制引用
董敏,赖酉城,毕盛..基于多模态场景记忆与指令提示的目标导航方法[J].华南理工大学学报(自然科学版),2026,54(2):1-15,15.基金项目
广东省自然科学基金项目(2022B1515020015)Supported by the Natural Science Foundation of Guangdong Province(2022B1515020015) (2022B1515020015)