Aiming at the terminal defense problem of aircraft,this paper proposes a method to simultaneously achieve terminal defense and seize the dominant position.The method employs aλ-return based reinforcement learning alg...Aiming at the terminal defense problem of aircraft,this paper proposes a method to simultaneously achieve terminal defense and seize the dominant position.The method employs aλ-return based reinforcement learning algorithm,which can be applied to the flight assistance decision-making system to improve the pilot’s survivability.First,we model the environment to simulate the interaction between air-to-air missiles and aircraft.Subsequently,we propose aλ-return based approach to improve the deep Q learning network(DQN),deep advantageous actor criticism(A2C),and proximity policy optimization(PPO)algorithms used to train manoeuvre strategies.The method employs an action space containing nine manoeuvres and defines the off-target distance at the end of the scene as a sparse reward for algorithm training.Simulation results show that the convergence speed of the three improved algorithms is significantly improved when using theλ-return method.Moreover,the effect of the fetch value on the convergence speed is verified by ablation experiments.In order to solve the illegal behavior problem in the training process,we also design a backtracking-based illegal behavior masking mechanism,which improves the data generation efficiency of the environment model and promotes effective algorithm training.展开更多
文摘Aiming at the terminal defense problem of aircraft,this paper proposes a method to simultaneously achieve terminal defense and seize the dominant position.The method employs aλ-return based reinforcement learning algorithm,which can be applied to the flight assistance decision-making system to improve the pilot’s survivability.First,we model the environment to simulate the interaction between air-to-air missiles and aircraft.Subsequently,we propose aλ-return based approach to improve the deep Q learning network(DQN),deep advantageous actor criticism(A2C),and proximity policy optimization(PPO)algorithms used to train manoeuvre strategies.The method employs an action space containing nine manoeuvres and defines the off-target distance at the end of the scene as a sparse reward for algorithm training.Simulation results show that the convergence speed of the three improved algorithms is significantly improved when using theλ-return method.Moreover,the effect of the fetch value on the convergence speed is verified by ablation experiments.In order to solve the illegal behavior problem in the training process,we also design a backtracking-based illegal behavior masking mechanism,which improves the data generation efficiency of the environment model and promotes effective algorithm training.