The soft continuum arm has extensive application in industrial production and human life due to its superior safety and flexibility. Reinforcement learning is a powerful technique for solving soft arm continuous contr...The soft continuum arm has extensive application in industrial production and human life due to its superior safety and flexibility. Reinforcement learning is a powerful technique for solving soft arm continuous control problems, which can learn an effective control policy with an unknown system model. However, it is often affected by the high sample complexity and requires huge amounts of data to train, which limits its effectiveness in soft arm control. An improved policy gradient method, policy gradient integrating long and short-term rewards denoted as PGLS, is proposed in this paper to overcome this issue. The shortterm rewards provide more dynamic-aware exploration directions for policy learning and improve the exploration efficiency of the algorithm. PGLS can be integrated into current policy gradient algorithms, such as deep deterministic policy gradient(DDPG). The overall control framework is realized and demonstrated in a dynamics simulation environment. Simulation results show that this approach can effectively control the soft arm to reach and track the targets. Compared with DDPG and other model-free reinforcement learning algorithms, the proposed PGLS algorithm has a great improvement in convergence speed and performance. In addition, a fluid-driven soft manipulator is designed and fabricated in this paper, which can verify the proposed PGLS algorithm in real experiments in the future.展开更多
The deep deterministic policy gradient(DDPG)algo-rithm is an off-policy method that combines two mainstream reinforcement learning methods based on value iteration and policy iteration.Using the DDPG algorithm,agents ...The deep deterministic policy gradient(DDPG)algo-rithm is an off-policy method that combines two mainstream reinforcement learning methods based on value iteration and policy iteration.Using the DDPG algorithm,agents can explore and summarize the environment to achieve autonomous deci-sions in the continuous state space and action space.In this paper,a cooperative defense with DDPG via swarms of unmanned aerial vehicle(UAV)is developed and validated,which has shown promising practical value in the effect of defending.We solve the sparse rewards problem of reinforcement learning pair in a long-term task by building the reward function of UAV swarms and optimizing the learning process of artificial neural network based on the DDPG algorithm to reduce the vibration in the learning process.The experimental results show that the DDPG algorithm can guide the UAVs swarm to perform the defense task efficiently,meeting the requirements of a UAV swarm for non-centralization,autonomy,and promoting the intelligent development of UAVs swarm as well as the decision-making process.展开更多
In the complex and variable deep-sea environment,the compensation control of ship motion ensures the safety and efficiency of equipment installation and transportation in offshore wind farms.However,the ship motion po...In the complex and variable deep-sea environment,the compensation control of ship motion ensures the safety and efficiency of equipment installation and transportation in offshore wind farms.However,the ship motion posture compensation control system is severely affected by uncertainties,which significantly impact the accuracy of compensation control.In this paper,we propose a ship three-degree-of-freedom(3-DoF)motion posture stabilization control method based on the DTW-LSTM-MATD3 algorithm.We use the multi-agent twin delayed deep deterministic policy gradient(MATD3)to control a platform with six electric cylinders to achieve stable control.However,owing to random noise affecting the ship’s motion posture,we use a dynamic time warping(DTW)algorithm to distinguish between high-frequency noise and low-frequency tracking signals.Further,we embed a long short-term memory(LSTM)network into the MATD3 network to better align the Critic network’s training with the true Q-value.We use a combined reward function to enhance the agent’s exploration capability in complex dynamic environments.Finally,verification was conducted under sixth-level,abrupt sea conditions with high-frequency noise,as well as under real abrupt sea conditions,and a generalization test was also carried out.Simulation results show that the proposed DTW-LSTM-MATD3 method has great compensation control ability.展开更多
基金partially supported by the National Key Research and Development Project Monitoring and Prevention of Major Natural Disasters Special Program (Grant No. 2020YFC1512202)the Anhui University Cooperative Innovation Project (Grant No. GXXT-2019-003)
文摘The soft continuum arm has extensive application in industrial production and human life due to its superior safety and flexibility. Reinforcement learning is a powerful technique for solving soft arm continuous control problems, which can learn an effective control policy with an unknown system model. However, it is often affected by the high sample complexity and requires huge amounts of data to train, which limits its effectiveness in soft arm control. An improved policy gradient method, policy gradient integrating long and short-term rewards denoted as PGLS, is proposed in this paper to overcome this issue. The shortterm rewards provide more dynamic-aware exploration directions for policy learning and improve the exploration efficiency of the algorithm. PGLS can be integrated into current policy gradient algorithms, such as deep deterministic policy gradient(DDPG). The overall control framework is realized and demonstrated in a dynamics simulation environment. Simulation results show that this approach can effectively control the soft arm to reach and track the targets. Compared with DDPG and other model-free reinforcement learning algorithms, the proposed PGLS algorithm has a great improvement in convergence speed and performance. In addition, a fluid-driven soft manipulator is designed and fabricated in this paper, which can verify the proposed PGLS algorithm in real experiments in the future.
基金supported by the Key Research and Development Program of Shaanxi(2022GY-089)the Natural Science Basic Research Program of Shaanxi(2022JQ-593).
文摘The deep deterministic policy gradient(DDPG)algo-rithm is an off-policy method that combines two mainstream reinforcement learning methods based on value iteration and policy iteration.Using the DDPG algorithm,agents can explore and summarize the environment to achieve autonomous deci-sions in the continuous state space and action space.In this paper,a cooperative defense with DDPG via swarms of unmanned aerial vehicle(UAV)is developed and validated,which has shown promising practical value in the effect of defending.We solve the sparse rewards problem of reinforcement learning pair in a long-term task by building the reward function of UAV swarms and optimizing the learning process of artificial neural network based on the DDPG algorithm to reduce the vibration in the learning process.The experimental results show that the DDPG algorithm can guide the UAVs swarm to perform the defense task efficiently,meeting the requirements of a UAV swarm for non-centralization,autonomy,and promoting the intelligent development of UAVs swarm as well as the decision-making process.
基金supported by the National Natural Science Foundation of China(No.52105466).
文摘In the complex and variable deep-sea environment,the compensation control of ship motion ensures the safety and efficiency of equipment installation and transportation in offshore wind farms.However,the ship motion posture compensation control system is severely affected by uncertainties,which significantly impact the accuracy of compensation control.In this paper,we propose a ship three-degree-of-freedom(3-DoF)motion posture stabilization control method based on the DTW-LSTM-MATD3 algorithm.We use the multi-agent twin delayed deep deterministic policy gradient(MATD3)to control a platform with six electric cylinders to achieve stable control.However,owing to random noise affecting the ship’s motion posture,we use a dynamic time warping(DTW)algorithm to distinguish between high-frequency noise and low-frequency tracking signals.Further,we embed a long short-term memory(LSTM)network into the MATD3 network to better align the Critic network’s training with the true Q-value.We use a combined reward function to enhance the agent’s exploration capability in complex dynamic environments.Finally,verification was conducted under sixth-level,abrupt sea conditions with high-frequency noise,as well as under real abrupt sea conditions,and a generalization test was also carried out.Simulation results show that the proposed DTW-LSTM-MATD3 method has great compensation control ability.