This paper comprehensively explores the impulsive on-orbit inspection game problem utilizing reinforcement learning and game training methods.The purpose of the spacecraft is to inspect the entire surface of a non-coo...This paper comprehensively explores the impulsive on-orbit inspection game problem utilizing reinforcement learning and game training methods.The purpose of the spacecraft is to inspect the entire surface of a non-cooperative target with active maneuverability in front lighting.First,the impulsive orbital game problem is formulated as a turn-based sequential game problem.Second,several typical relative orbit transfers are encapsulated into modules to construct a parameterized action space containing discrete modules and continuous parameters,and multi-pass deep Q-networks(MPDQN)algorithm is used to implement autonomous decision-making.Then,a curriculum learning method is used to gradually increase the difficulty of the training scenario.The backtracking proportional self-play training framework is used to enhance the agent’s ability to defeat inconsistent strategies by building a pool of opponents.The behavior variations of the agents during training indicate that the intelligent game system gradually evolves towards an equilibrium situation.The restraint relations between the agents show that the agents steadily improve the strategy.The influence of various factors on game results is tested.展开更多
Aiming at addressing the problem of manoeuvring decision-making in UAV air combat,this study establishes a one-to-one air combat model,defines missile attack areas,and uses the non-deterministic policy Soft-Actor-Crit...Aiming at addressing the problem of manoeuvring decision-making in UAV air combat,this study establishes a one-to-one air combat model,defines missile attack areas,and uses the non-deterministic policy Soft-Actor-Critic(SAC)algorithm in deep reinforcement learning to construct a decision model to realize the manoeuvring process.At the same time,the complexity of the proposed algorithm is calculated,and the stability of the closed-loop system of air combat decision-making controlled by neural network is analysed by the Lyapunov function.This study defines the UAV air combat process as a gaming process and proposes a Parallel Self-Play training SAC algorithm(PSP-SAC)to improve the generalisation performance of UAV control decisions.Simulation results have shown that the proposed algorithm can realize sample sharing and policy sharing in multiple combat environments and can significantly improve the generalisation ability of the model compared to independent training.展开更多
A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for genera...A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.展开更多
Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games,e.g.,StarCraft and poker.Neural Fictitious Self-Play(NFSP)is an effective algorithm that lea...Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games,e.g.,StarCraft and poker.Neural Fictitious Self-Play(NFSP)is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge.However,it needs to train a neural network in an off-policy manner to approximate the action values.For games with large search spaces,the training may suffer from unnecessary exploration and sometimes fails to converge.In this paper,we propose a new Neural Fictitious Self-Play algorithm that combines Monte Carlo tree search with NFSP,called MC-NFSP,to improve the performance in real-time zero-sum imperfect-information games.With experiments and empirical analysis,we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not.Furthermore,we develop an Asynchronous Neural Fictitious Self-Play framework(ANFSP).It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality.The experiments with th e games with hidden state information(Texas Hold^m),and the FPS(firstperson shooter)games demonstrate effectiveness of our algorithms.展开更多
The high maneuverability of modern fighters in close air combat imposes significant cognitive demands on pilots,making rapid,accurate decision-making challenging.While reinforcement learning(RL)has shown promise in th...The high maneuverability of modern fighters in close air combat imposes significant cognitive demands on pilots,making rapid,accurate decision-making challenging.While reinforcement learning(RL)has shown promise in this domain,the existing methods often lack strategic depth and generalization in complex,high-dimensional environments.To address these limitations,this paper proposes an optimized self-play method enhanced by advancements in fighter modeling,neural network design,and algorithmic frameworks.This study employs a six-degree-of-freedom(6-DOF)F-16 fighter model based on open-source aerodynamic data,featuring airborne equipment and a realistic visual simulation platform,unlike traditional 3-DOF models.To capture temporal dynamics,Long Short-Term Memory(LSTM)layers are integrated into the neural network,complemented by delayed input stacking.The RL environment incorporates expert strategies,curiositydriven rewards,and curriculum learning to improve adaptability and strategic decision-making.Experimental results demonstrate that the proposed approach achieves a winning rate exceeding90%against classical single-agent methods.Additionally,through enhanced 3D visual platforms,we conducted human-agent confrontation experiments,where the agent attained an average winning rate of over 75%.The agent's maneuver trajectories closely align with human pilot strategies,showcasing its potential in decision-making and pilot training applications.This study highlights the effectiveness of integrating advanced modeling and self-play techniques in developing robust air combat decision-making systems.展开更多
TheUAV pursuit-evasion problem focuses on the efficient tracking and capture of evading targets using unmanned aerial vehicles(UAVs),which is pivotal in public safety applications,particularly in scenarios involving i...TheUAV pursuit-evasion problem focuses on the efficient tracking and capture of evading targets using unmanned aerial vehicles(UAVs),which is pivotal in public safety applications,particularly in scenarios involving intrusion monitoring and interception.To address the challenges of data acquisition,real-world deployment,and the limited intelligence of existing algorithms in UAV pursuit-evasion tasks,we propose an innovative swarm intelligencebased UAV pursuit-evasion control framework,namely“Boids Model-based DRL Approach for Pursuit and Escape”(Boids-PE),which synergizes the strengths of swarm intelligence from bio-inspired algorithms and deep reinforcement learning(DRL).The Boids model,which simulates collective behavior through three fundamental rules,separation,alignment,and cohesion,is adopted in our work.By integrating Boids model with the Apollonian Circles algorithm,significant improvements are achieved in capturing UAVs against simple evasion strategies.To further enhance decision-making precision,we incorporate a DRL algorithm to facilitate more accurate strategic planning.We also leverage self-play training to continuously optimize the performance of pursuit UAVs.During experimental evaluation,we meticulously designed both one-on-one and multi-to-one pursuit-evasion scenarios,customizing the state space,action space,and reward function models for each scenario.Extensive simulations,supported by the PyBullet physics engine,validate the effectiveness of our proposed method.The overall results demonstrate that Boids-PE significantly enhance the efficiency and reliability of UAV pursuit-evasion tasks,providing a practical and robust solution for the real-world application of UAV pursuit-evasion missions.展开更多
Tibetan Jiu chess,recognized as a national intangible cultural heritage,is a complex game comprising two distinct phases:the layout phase and the battle phase.Improving the performance of deep reinforcement learning(D...Tibetan Jiu chess,recognized as a national intangible cultural heritage,is a complex game comprising two distinct phases:the layout phase and the battle phase.Improving the performance of deep reinforcement learning(DRL)models for Tibetan Jiu chess is challenging,especially given the constraints of hardware resources.To address this,we propose a two-stage model called JFA,which incorporates hierarchical neural networks and knowledge-guided techniques.The model includes sub-models:strategic layout model(SLM)for the layout phase and hierarchical battle model(HBM)for the battle phase.Both sub-models use similar network structures and employ parallel Monte Carlo tree search(MCTS)methods for independent self-play training.HBM is structured as a hierarchical neural network,with the upper network selecting movement and jump capturing actions and the lower network handling square capturing actions.Human knowledge-based auxiliary agents are introduced to assist SLM and HBM,simulating the entire game and providing reward signals based on square capturing or victory outcomes.Additionally,within the HBM,we propose two human knowledge-based pruning methods that prune parallel MCTS and capture actions in the lower network.In the experiments against a layout model using the AlphaZero method,SLM achieves a 74%win rate,with the decision-making time being reduced to approximately 1/147 of the time required by the AlphaZero model.SLM also won the first place at the 2024 China National Computer Game Tournament.HBM achieves a 70%win rate when playing against other Tibetan Jiu chess models.When used together,SLM and HBM in JFA achieve an 81%win rate,comparable to the level of a human amateur 4-dan player.These results demonstrate that JFA effectively enhances artificial intelligence(AI)performance in Tibetan Jiu chess.展开更多
With the breakthrough of AlphaGo,human-computer gaming AI has ushered in a big explosion,attracting more and more researchers all over the world.As a recognized standard for testing artificial intelligence,various hum...With the breakthrough of AlphaGo,human-computer gaming AI has ushered in a big explosion,attracting more and more researchers all over the world.As a recognized standard for testing artificial intelligence,various human-computer gaming AI systems(AIs)have been developed,such as Libratus,OpenAI Five,and AlphaStar,which beat professional human players.The rapid development of human-computer gaming AIs indicates a big step for decision-making intelligence,and it seems that current techniques can handle very complex human-computer games.So,one natural question arises:What are the possible challenges of current techniques in human-computer gaming and what are the future trends?To answer the above question,in this paper,we survey recent successful game AIs,covering board game AIs,card game AIs,first-person shooting game AIs,and real-time strategy game AIs.Through this survey,we 1)compare the main difficulties among different kinds of games and the corresponding techniques utilized for achieving professional human-level AIs;2)summarize the mainstream frameworks and techniques that can be properly relied on for developing AIs for complex human-computer games;3)raise the challenges or drawbacks of current techniques in the successful AIs;and 4)try to point out future trends in human-computer gaming AIs.Finally,we hope that this brief review can provide an introduction for beginners and inspire insight for researchers in the field of AI in human-computer gaming.展开更多
With the breakthrough of AlphaGo,deep reinforcement learning has become a recognized technique for solving sequential decision-making problems.Despite its reputation,data inefficiency caused by its trial and error lea...With the breakthrough of AlphaGo,deep reinforcement learning has become a recognized technique for solving sequential decision-making problems.Despite its reputation,data inefficiency caused by its trial and error learning mechanism makes deep reinforcement learning difficult to apply in a wide range of areas.Many methods have been developed for sample efficient deep reinforcement learning,such as environment modelling,experience transfer,and distributed modifications,among which distributed deep reinforcement learning has shown its potential in various applications,such as human-computer gaming and intelligent transportation.In this paper,we conclude the state of this exciting field,by comparing the classical distributed deep reinforcement learning methods and studying important components to achieve efficient distributed learning,covering single player single agent distributed deep reinforcement learning to the most complex multiple players multiple agents distributed deep reinforcement learning.Furthermore,we review recently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distributed versions.By analysing their strengths and weaknesses,a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released,which is further validated on Wargame,a complex environment,showing the usability of the proposed toolbox for multiple players and multiple agents distributed deep reinforcement learning under complex games.Finally,we try to point out challenges and future trends,hoping that this brief review can provide a guide or a spark for researchers who are interested in distributed deep reinforcement learning.展开更多
文摘This paper comprehensively explores the impulsive on-orbit inspection game problem utilizing reinforcement learning and game training methods.The purpose of the spacecraft is to inspect the entire surface of a non-cooperative target with active maneuverability in front lighting.First,the impulsive orbital game problem is formulated as a turn-based sequential game problem.Second,several typical relative orbit transfers are encapsulated into modules to construct a parameterized action space containing discrete modules and continuous parameters,and multi-pass deep Q-networks(MPDQN)algorithm is used to implement autonomous decision-making.Then,a curriculum learning method is used to gradually increase the difficulty of the training scenario.The backtracking proportional self-play training framework is used to enhance the agent’s ability to defeat inconsistent strategies by building a pool of opponents.The behavior variations of the agents during training indicate that the intelligent game system gradually evolves towards an equilibrium situation.The restraint relations between the agents show that the agents steadily improve the strategy.The influence of various factors on game results is tested.
基金National Natural Science Foundation of China,Grant/Award Number:62003267Fundamental Research Funds for the Central Universities,Grant/Award Number:G2022KY0602+1 种基金Technology on Electromagnetic Space Operations and Applications Laboratory,Grant/Award Number:2022ZX0090Key Core Technology Research Plan of Xi'an,Grant/Award Number:21RGZN0016。
文摘Aiming at addressing the problem of manoeuvring decision-making in UAV air combat,this study establishes a one-to-one air combat model,defines missile attack areas,and uses the non-deterministic policy Soft-Actor-Critic(SAC)algorithm in deep reinforcement learning to construct a decision model to realize the manoeuvring process.At the same time,the complexity of the proposed algorithm is calculated,and the stability of the closed-loop system of air combat decision-making controlled by neural network is analysed by the Lyapunov function.This study defines the UAV air combat process as a gaming process and proposes a Parallel Self-Play training SAC algorithm(PSP-SAC)to improve the generalisation performance of UAV control decisions.Simulation results have shown that the proposed algorithm can realize sample sharing and policy sharing in multiple combat environments and can significantly improve the generalisation ability of the model compared to independent training.
文摘A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.
基金National Key Research and Development Program of China(2017YFB1002503)Science and Technology Innovation 2030-“New Generation Artificial Intelligence”Major Project(2018AAA0100902),China.
文摘Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games,e.g.,StarCraft and poker.Neural Fictitious Self-Play(NFSP)is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge.However,it needs to train a neural network in an off-policy manner to approximate the action values.For games with large search spaces,the training may suffer from unnecessary exploration and sometimes fails to converge.In this paper,we propose a new Neural Fictitious Self-Play algorithm that combines Monte Carlo tree search with NFSP,called MC-NFSP,to improve the performance in real-time zero-sum imperfect-information games.With experiments and empirical analysis,we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not.Furthermore,we develop an Asynchronous Neural Fictitious Self-Play framework(ANFSP).It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality.The experiments with th e games with hidden state information(Texas Hold^m),and the FPS(firstperson shooter)games demonstrate effectiveness of our algorithms.
基金co-supported by the National Natural Science Foundation of China(No.91852115)。
文摘The high maneuverability of modern fighters in close air combat imposes significant cognitive demands on pilots,making rapid,accurate decision-making challenging.While reinforcement learning(RL)has shown promise in this domain,the existing methods often lack strategic depth and generalization in complex,high-dimensional environments.To address these limitations,this paper proposes an optimized self-play method enhanced by advancements in fighter modeling,neural network design,and algorithmic frameworks.This study employs a six-degree-of-freedom(6-DOF)F-16 fighter model based on open-source aerodynamic data,featuring airborne equipment and a realistic visual simulation platform,unlike traditional 3-DOF models.To capture temporal dynamics,Long Short-Term Memory(LSTM)layers are integrated into the neural network,complemented by delayed input stacking.The RL environment incorporates expert strategies,curiositydriven rewards,and curriculum learning to improve adaptability and strategic decision-making.Experimental results demonstrate that the proposed approach achieves a winning rate exceeding90%against classical single-agent methods.Additionally,through enhanced 3D visual platforms,we conducted human-agent confrontation experiments,where the agent attained an average winning rate of over 75%.The agent's maneuver trajectories closely align with human pilot strategies,showcasing its potential in decision-making and pilot training applications.This study highlights the effectiveness of integrating advanced modeling and self-play techniques in developing robust air combat decision-making systems.
文摘TheUAV pursuit-evasion problem focuses on the efficient tracking and capture of evading targets using unmanned aerial vehicles(UAVs),which is pivotal in public safety applications,particularly in scenarios involving intrusion monitoring and interception.To address the challenges of data acquisition,real-world deployment,and the limited intelligence of existing algorithms in UAV pursuit-evasion tasks,we propose an innovative swarm intelligencebased UAV pursuit-evasion control framework,namely“Boids Model-based DRL Approach for Pursuit and Escape”(Boids-PE),which synergizes the strengths of swarm intelligence from bio-inspired algorithms and deep reinforcement learning(DRL).The Boids model,which simulates collective behavior through three fundamental rules,separation,alignment,and cohesion,is adopted in our work.By integrating Boids model with the Apollonian Circles algorithm,significant improvements are achieved in capturing UAVs against simple evasion strategies.To further enhance decision-making precision,we incorporate a DRL algorithm to facilitate more accurate strategic planning.We also leverage self-play training to continuously optimize the performance of pursuit UAVs.During experimental evaluation,we meticulously designed both one-on-one and multi-to-one pursuit-evasion scenarios,customizing the state space,action space,and reward function models for each scenario.Extensive simulations,supported by the PyBullet physics engine,validate the effectiveness of our proposed method.The overall results demonstrate that Boids-PE significantly enhance the efficiency and reliability of UAV pursuit-evasion tasks,providing a practical and robust solution for the real-world application of UAV pursuit-evasion missions.
基金supported by the National Natural Science Foundation of China(Nos.62276285 and 62236011)。
文摘Tibetan Jiu chess,recognized as a national intangible cultural heritage,is a complex game comprising two distinct phases:the layout phase and the battle phase.Improving the performance of deep reinforcement learning(DRL)models for Tibetan Jiu chess is challenging,especially given the constraints of hardware resources.To address this,we propose a two-stage model called JFA,which incorporates hierarchical neural networks and knowledge-guided techniques.The model includes sub-models:strategic layout model(SLM)for the layout phase and hierarchical battle model(HBM)for the battle phase.Both sub-models use similar network structures and employ parallel Monte Carlo tree search(MCTS)methods for independent self-play training.HBM is structured as a hierarchical neural network,with the upper network selecting movement and jump capturing actions and the lower network handling square capturing actions.Human knowledge-based auxiliary agents are introduced to assist SLM and HBM,simulating the entire game and providing reward signals based on square capturing or victory outcomes.Additionally,within the HBM,we propose two human knowledge-based pruning methods that prune parallel MCTS and capture actions in the lower network.In the experiments against a layout model using the AlphaZero method,SLM achieves a 74%win rate,with the decision-making time being reduced to approximately 1/147 of the time required by the AlphaZero model.SLM also won the first place at the 2024 China National Computer Game Tournament.HBM achieves a 70%win rate when playing against other Tibetan Jiu chess models.When used together,SLM and HBM in JFA achieve an 81%win rate,comparable to the level of a human amateur 4-dan player.These results demonstrate that JFA effectively enhances artificial intelligence(AI)performance in Tibetan Jiu chess.
基金National Natural Science Foundation of China(No.61906197).
文摘With the breakthrough of AlphaGo,human-computer gaming AI has ushered in a big explosion,attracting more and more researchers all over the world.As a recognized standard for testing artificial intelligence,various human-computer gaming AI systems(AIs)have been developed,such as Libratus,OpenAI Five,and AlphaStar,which beat professional human players.The rapid development of human-computer gaming AIs indicates a big step for decision-making intelligence,and it seems that current techniques can handle very complex human-computer games.So,one natural question arises:What are the possible challenges of current techniques in human-computer gaming and what are the future trends?To answer the above question,in this paper,we survey recent successful game AIs,covering board game AIs,card game AIs,first-person shooting game AIs,and real-time strategy game AIs.Through this survey,we 1)compare the main difficulties among different kinds of games and the corresponding techniques utilized for achieving professional human-level AIs;2)summarize the mainstream frameworks and techniques that can be properly relied on for developing AIs for complex human-computer games;3)raise the challenges or drawbacks of current techniques in the successful AIs;and 4)try to point out future trends in human-computer gaming AIs.Finally,we hope that this brief review can provide an introduction for beginners and inspire insight for researchers in the field of AI in human-computer gaming.
基金supported by Open Fund/Postdoctoral Fund of the Laboratory of Cognition and Decision Intelligence for Complex Systems,Institute of Automation,Chinese Academy of Sciences,China(No.CASIA-KFKTXDA27040809).
文摘With the breakthrough of AlphaGo,deep reinforcement learning has become a recognized technique for solving sequential decision-making problems.Despite its reputation,data inefficiency caused by its trial and error learning mechanism makes deep reinforcement learning difficult to apply in a wide range of areas.Many methods have been developed for sample efficient deep reinforcement learning,such as environment modelling,experience transfer,and distributed modifications,among which distributed deep reinforcement learning has shown its potential in various applications,such as human-computer gaming and intelligent transportation.In this paper,we conclude the state of this exciting field,by comparing the classical distributed deep reinforcement learning methods and studying important components to achieve efficient distributed learning,covering single player single agent distributed deep reinforcement learning to the most complex multiple players multiple agents distributed deep reinforcement learning.Furthermore,we review recently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distributed versions.By analysing their strengths and weaknesses,a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released,which is further validated on Wargame,a complex environment,showing the usability of the proposed toolbox for multiple players and multiple agents distributed deep reinforcement learning under complex games.Finally,we try to point out challenges and future trends,hoping that this brief review can provide a guide or a spark for researchers who are interested in distributed deep reinforcement learning.