基于博弈论与强化学习的多智能体路径规划算法OA北大核心CSTPCD
A multi-agent path planning algorithm based on game theory and reinforcement learning
针对平面上多个智能体构成的路径规划求解算法普遍存在的速度慢效率低等问题进行研究,将多智能体路径规划问题归结为非零和随机博弈,使用多智能体强化学习算法赢或快速学习-策略爬山(win or learn fast-policy hill-climbing,WoLF-PHC)得到纳什均衡策略,为各智能体做出无冲突的最优路径决策,提出能够快速自适应的WoLF-PHC(fast adaptive WoLF-PHC,FA-WoLF-PHC)算法,通过构建目标函数,使用梯度下降对学习率进行自适应更新.在猜硬币和自定义收益矩阵2个博弈场景中使用FA-WoLF-PHC,并与策略爬山(policy hill-climbing,PHC)算法和Wolf-PHC算法进行比较.结果表明,FA-WoLF-PHC算法的学习速度较WoLF-PHC算法有所提升,并有效减小了WoLF-PHC算法和PHC算法在学习过程中出现的振荡现象.在多智能体路径规划问题中,FA-WoLF-PHC算法的学习速度比WoLF-PHC算法提高了16.01%.将路径规划问题的环境栅格地图扩大为6×6,智能体数量增加为3个时,FA-WoLF-PHC、WoLF-PSP 和多头绒泡菌-人工势场Sarsa(physarum polycephalum-artificial potential state-action-reward-state-action,PP-AP Sarsa)算法在10次实验中学习到最终策略需要的平均时间分别为16.30、20.59和17.72 s.在多智能体路径规划问题中,FA-WoLF-PHC算法能够得到各智能体的纳什均衡策略,学习速度较WoLF-PSP 和PP-AP Sarsa算法有显著提高.FA-WoLF-PHC算法在常见的博弈场景中能够快速获得纳什策略,在多智能体路径规划问题中可为多个智能体生成无冲突的最优路径,并且在学习速度等方面较其他算法有显著提高.
This study addresses the issue of slow speed and low efficiency in path planning algorithms for multi-agents on a plane.We formulate the multi-agent path planning problem as a non-zero-sum stochastic game and employ the multi-agent reinforcement learning algorithm,win or learn fast-policy hill-climbing(WoLF-PHC),to obtain a Nash equilibrium strategy to make conflict free optimal path decisions for each agent.Additionally,we propose a fast adaptive WoLF-PHC(FA-WoLF-PHC)algorithm to improve the WoLF-PHC algorithm by constructing an objective function and using gradient descent method to adaptively update the learning rate.FA-WoLF-PHC is applied in two game scenarios:coin flipping and custom payoff matrix,and compared with policy hill-climbing(PHC)and WoLF-PHC algorithms.The experimental data show that the FA-WoLF-PHC algorithm improves the learning speed compared to the WoLF-PHC and effectively reduces oscillations in the learning process of WoLF-PHC and PHC algorithms.In the multi-agent path planning problems,the FA-WoLF-PHC algorithm improves the learning speed by 16.01%compared to WoLF-PHC algorithm.When the environmental grid map is expanded to 6×6 and the number of agents is increased to 3,the average times required for the FA-WoLF-PHC,WoLF-PSP and PP-AP Sarsa(Physarum Polycephalum-Artificial Potential Sarsa)algorithms are 16.30,20.59,and 17.72 seconds,respectively,to learn the final strategies over 10 experiments.The results show that the FA-WoLF-PHC algorithm can obtain Nash equilibrium strategies for each agent,and it effectively improves learning speed compared to the WoLF-PSP and PP-AP Sarsa algorithms.The FA-WoLF-PHC algorithm can quickly learn Nash strategies in common game scenarios and generate conflict-free optimal paths for each agent in multi-agent path planning problems,showing significant improvement in learning speed compared with other algorithms.
熊文博;郭磊;焦彤宇
北京邮电大学人工智能学院,北京 100876
计算机与自动化
人工智能博弈论动态规划纳什均衡策略强化学习多智能体路径规划
artificial intelligencegame theorydynamic programmingNash equilibrium strategyreinforcement learningmulti-agent path planning
《深圳大学学报(理工版)》 2024 (003)
274-282 / 9
National Natural Science Foundation of China(61105103) 国家自然科学基金资助项目(61105103)
评论