首页|期刊导航|工程科学与技术|基于时间尺度分离理论的空战深度强化学习分层算法

基于时间尺度分离理论的空战深度强化学习分层算法

谭泰江泰民黎博文李杰李辉化晨昊

工程科学与技术2026，Vol.58Issue(2)：69-83,15.

工程科学与技术2026，Vol.58Issue(2)：69-83,15.DOI:10.12454/j.jsuese.202400419

基于时间尺度分离理论的空战深度强化学习分层算法

Hierarchical Deep Reinforcement Learning Algorithm for Air Combat Based on Time Scale Separation Theory

谭泰 ¹江泰民 ¹黎博文 ¹李杰 ¹李辉 ²化晨昊¹

作者信息

1. 四川大学计算机学院,四川成都 610065
2. 四川大学计算机学院,四川成都 610065||四川大学视觉合成图形图像技术国家级重点实验室,四川成都 610065
折叠

摘要

Abstract

Objective Six-degree-of-freedom(6-DoF)unmanned aerial vehicle(UAV)air combat scenarios present substantial challenges for strategy learn-ing when reinforcement learning methods are applied.These challenges stem from high-dimensional state spaces,continuously coupled action do-mains,and strongly nonlinear flight dynamics.Conventional end-to-end deep reinforcement learning(DRL)approaches struggle to achieve rapid convergence,to identify effective maneuver strategies,and to generalize learned policies beyond narrowly constrained conditions.In addition,re-ward functions often rely on handcrafted rules derived from human expertise,which do not ensure that higher reward values correspond to genu-inely effective combat strategies.This study addresses these limitations by introducing a hierarchical framework based on time scale separation theory.The proposed framework employs a two-stage training procedure that accounts for differences in how flight parameters influence state variables across multiple time scales,improving learning efficiency,enhancing strategy quality,and increasing generalization capability in com-plex and diverse combat environments. Methods A novel algorithm,termed TTS‒PPO,was developed.TTS‒PPO stood for a Two-Stage Training framework leveraging Time-Scale separation within Proximal Policy Optimization.The method focused on partitioning the 6 DoF UAV air combat decision-making process into short-cycle and long-cycle segments,which reflected differences in how control inputs influenced state variables across distinct time scales.A time-division framework was established.The short‒cycle component addressed rapid rotational and attitude adjustments.Instead of allowing the DRL procedure to directly manage these fine-grained actions,a Proportional-Integral-Derivative(PID)controller was em-ployed to output real-time joystick commands.This configuration allowed classical low-level stability and attitude control to be handled in-dependently,which reduced the complexity encountered by the DRL policy at the higher strategic level.With low-level stability assured,the DRL agent focused on tactical and strategic decision-making.The long-cycle component used Proximal Policy Optimization to manage trajectory planning and tactical maneuvers.The long-cycle PPO agent effectively decoupled strategic decision-making from low-level actua-tion tasks by issuing high-level commands to guide the PID-driven short-cycle layer.This hierarchical decomposition allowed learning to proceed more efficiently.The long-cycle agent encountered a reduced problem space and concentrated on discovering effective combat strategies without being burdened by the complexities of rapid stabilization maneuvers.Time scale separation was further implemented within the state space.Environmental states were divided into long-cycle and short-cycle groups.The long-cycle states captured slowly evolving features such as relative positions,energy conditions,and global situational parameters,whereas the short-cycle states encom-passed rapidly changing variables such as angular rates and orientation deviations.Aligning state variables with their corresponding time scales accelerated learning and improved policy robustness.A relative situation transformation module was introduced to refine and com-press the state representation,which ensured that the agent received relevant information at appropriate decision intervals and minimized computational complexity at each step.A two-stage training strategy was employed.In the first stage,single-step rewards designed for spe-cific subtasks,such as pursuit or strike,were introduced with a lower decision frequency to assist the agent during the initial"cold start"pe-riod.This incremental guidance supported the stabilization of fundamental behavioral patterns and facilitated the acquisition of essential tactical principles.During this phase,the agent overcame early-stage instability,which resulted in more reliable initial policies.In the sec-ond stage of training,single-step rewards were removed,and only sparse terminal rewards were retained.In the second stage of training,single-step rewards were removed,and only sparse terminal rewards were retained,while the decision frequency was increased.In the ab-sence of frequent intermediate rewards,the policy emphasized long-term outcomes rather than short-term objectives.The higher decision frequency enabled more refined tactical adjustments and encouraged the emergence of maneuvers that improved overall performance.The gradual transition from a guided,intermediate-reward scenario to a sparse-reward,high-frequency regime allowed the policy to progress from basic stability toward advanced strategic competence.A simulation environment was constructed using an open-source F-16 UAV model coupled with the JSBSim flight dynamics engine to evaluate the effectiveness of the proposed hierarchical DRL algorithm founded on time scale separation theory.This configuration provided realistic 6 DoF conditions and supported one-on-one close-range air combat simulations.Ablation experiments were conducted to assess the contribution of individual components within the TTS‒PPO framework.One configuration trained the agent against a non-maneuvering linear opponent,which served as a controlled baseline for determining whether the learned policy can scale from simple engagements to more complex combat scenarios. Results and Discussions The results demonstrated that the TTS‒PPO approach,which incorporated hierarchical decomposition and time scale separation,achieved faster convergence and improved final performance metrics compared to baseline end-to-end DRL methods that lacked time scale separation or a two-stage training procedure.Assigning state variables to short-cycle and long-cycle categories,together with hierarchical action decomposition,significantly reduced overall problem complexity.Training convergence speed improved by ap-proximately 67%,which reduced computational costs and enabled more frequent iterative policy refinements.With enhanced efficiency,the DRL agent discovered more stable and effective combat strategies within fewer training episodes.Generalization performance was evalu-ated by testing agents trained under different variants of the approach across various initial conditions,velocities,and adversary tactics.Comparisons were conducted among three agent types:an agent trained with PPO on full-state inputs without time scale division(FS‒PPO),an agent using time scale-separated states with a single-stage training approach(TS‒PPO),and the two-stage time scale-separated TTS‒PPO.The agent trained with TTS‒PPO outperformed both FS‒PPO and TS‒PPO agents in pairwise confrontations,which indicated that combining time scale separation with two-stage training not only enhanced learning speed but also enabled the agent to acquire more generalizable combat principles rather than narrowly optimizing for a specific scenario.Further validation involved testing the TTS‒PPO-trained agent against rule-based expert opponents.The policy derived from TTS‒PPO successfully defeated these expert systems.Even when training was conducted exclusively against a simple linear adversary,the learned policy surpassed expert-level strategies,which con-firmed that hierarchical time scale separation and the two-stage training design facilitated the development of adaptable policies with robust tactical proficiency.The ability to transfer from minimal training complexity to outperforming expert opponents highlighted the scalability and versatility of the learned strategies. Conclusions Accordingly,the hierarchical DRL algorithm,grounded in time scale separation theory and employing a two-stage training strategy,addressed significant challenges associated with applying DRL to 6-DoF UAV air combat tasks.The method substantially improved training effi-ciency and enhanced both the robustness and generalization capability of the resulting policies by decomposing decision-making into short-cycle and long-cycle phases,introducing a PID-controlled low-level stabilization layer,and separating state variables based on their respective time scales.The hierarchical framework enabled the agent to focus on strategic maneuvers at the long-cycle level,while the short-cycle PID layer man-aged rapid stabilization tasks.Time scale-aware state representations and a staged training procedure guided the policy from basic stability to ad-vanced tactical competence.The observed increases in convergence speed and the ability to manage a range of adversarial conditions highlight the value of applying time scale separation principles in challenging reinforcement learning domains.The TTS‒PPO framework can serve as a ref-erence for addressing other complex reinforcement learning problems characterized by distinct time scale dynamics,fostering more efficient,gen-eralizable,and strategically effective decision-making in advanced autonomous systems.

关键词

时间尺度分离/比例‒积分‒微分/近端策略优化/两阶段训练/两阶段时间尺度状态分离近端策略优化

Key words

time-scale separation/PID/PPO/two-stage training/TTS‒PPO

分类

信息技术与安全科学

引用本文复制引用

谭泰,江泰民,黎博文,李杰,李辉,化晨昊..基于时间尺度分离理论的空战深度强化学习分层算法[J].工程科学与技术,2026,58(2):69-83,15.

基金项目

国家自然科学基金‒联合基金项目(U20A20161) （U20A20161）

工程科学与技术

ISSN：2096-3246

访问量1

下载量0

段落导航