Q-learning is a popular temporal-difference reinforcement learning algorithm which often explicitly stores state values using lookup tables. This implementation has been proven to converge to the optimal solution, but...Q-learning is a popular temporal-difference reinforcement learning algorithm which often explicitly stores state values using lookup tables. This implementation has been proven to converge to the optimal solution, but it is often beneficial to use a function-approximation system, such as deep neural networks, to estimate state values. It has been previously observed that Q-learning can be unstable when using value function approximation or when operating in a stochastic environment. This instability can adversely affect the algorithm’s ability to maximize its returns. In this paper, we present a new algorithm called Multi Q-learning to attempt to overcome the instability seen in Q-learning. We test our algorithm on a 4 × 4 grid-world with different stochastic reward functions using various deep neural networks and convolutional networks. Our results show that in most cases, Multi Q-learning outperforms Q-learning, achieving average returns up to 2.5 times higher than Q-learning and having a standard deviation of state values as low as 0.58.展开更多
The dwell scheduling problem for a multifunctional radar system is led to the formation of corresponding optimiza-tion problem.In order to solve the resulting optimization prob-lem,the dwell scheduling process in a sc...The dwell scheduling problem for a multifunctional radar system is led to the formation of corresponding optimiza-tion problem.In order to solve the resulting optimization prob-lem,the dwell scheduling process in a scheduling interval(SI)is formulated as a Markov decision process(MDP),where the state,action,and reward are specified for this dwell scheduling problem.Specially,the action is defined as scheduling the task on the left side,right side or in the middle of the radar idle time-line,which reduces the action space effectively and accelerates the convergence of the training.Through the above process,a model-free reinforcement learning framework is established.Then,an adaptive dwell scheduling method based on Q-learn-ing is proposed,where the converged Q value table after train-ing is utilized to instruct the scheduling process.Simulation results demonstrate that compared with existing dwell schedul-ing algorithms,the proposed one can achieve better scheduling performance considering the urgency criterion,the importance criterion and the desired execution time criterion comprehen-sively.The average running time shows the proposed algorithm has real-time performance.展开更多
针对无监督环境下传统网络异常诊断算法存在异常点定位和异常数据分类准确率低等不足,通过设计一种基于改进Q-learning算法的无线网络异常诊断方法:首先基于ADU(Asynchronous Data Unit异步数据单元)单元采集无线网络的数据流,并提取数...针对无监督环境下传统网络异常诊断算法存在异常点定位和异常数据分类准确率低等不足,通过设计一种基于改进Q-learning算法的无线网络异常诊断方法:首先基于ADU(Asynchronous Data Unit异步数据单元)单元采集无线网络的数据流,并提取数据包特征;然后构建Q-learning算法模型探索状态值和奖励值的平衡点,利用SA(Simulated Annealing模拟退火)算法从全局视角对下一时刻状态进行精确识别;最后确定训练样本的联合分布概率,提升输出值的逼近性能以达到平衡探索与代价之间的均衡。测试结果显示:改进Q-learning算法的网络异常定位准确率均值达99.4%,在不同类型网络异常的分类精度和分类效率等方面,也优于三种传统网络异常诊断方法。展开更多
随着智能体在复杂动态环境中的路径规划需求日益增长,传统Q-Learning算法在收敛速度、避障效率及全局优化能力上的局限性逐渐凸显。针对Q-Learning算法在路径规划中的不足,本文提出一种结合动态学习率、自适应探索率与蒙特卡洛树搜索(Mo...随着智能体在复杂动态环境中的路径规划需求日益增长,传统Q-Learning算法在收敛速度、避障效率及全局优化能力上的局限性逐渐凸显。针对Q-Learning算法在路径规划中的不足,本文提出一种结合动态学习率、自适应探索率与蒙特卡洛树搜索(Monte Carlo Tree Search,MCTS)的改进方法。首先,通过引入指数衰减的动态学习率与探索率,以平衡算法在训练初期的探索能力与后期的策略稳定性;其次,将MCTS与Q-Learning结合,利用MCTS的全局搜索特性优化Q值更新过程;此外,融合启发式函数以改进奖励机制,引导智能体更高效地逼近目标。实验结果表明,改进算法的平均步数、收敛速度、稳定性等相较于传统算法提升显著,本研究为复杂环境下的智能体路径规划提供了一种高效、鲁棒的解决方案。展开更多
Routing plays a critical role in data transmission for underwater acoustic sensor networks(UWSNs)in the internet of underwater things(IoUT).Traditional routing methods suffer from high end-toend delay,limited bandwidt...Routing plays a critical role in data transmission for underwater acoustic sensor networks(UWSNs)in the internet of underwater things(IoUT).Traditional routing methods suffer from high end-toend delay,limited bandwidth,and high energy consumption.With the development of artificial intelligence and machine learning algorithms,many researchers apply these new methods to improve the quality of routing.In this paper,we propose a Qlearning-based multi-hop cooperative routing protocol(QMCR)for UWSNs.Our protocol can automatically choose nodes with the maximum Q-value as forwarders based on distance information.Moreover,we combine cooperative communications with Q-learning algorithm to reduce network energy consumption and improve communication efficiency.Experimental results show that the running time of the QMCR is less than one-tenth of that of the artificial fish-swarm algorithm(AFSA),while the routing energy consumption is kept at the same level.Due to the extremely fast speed of the algorithm,the QMCR is a promising method of routing design for UWSNs,especially for the case that it suffers from the extreme dynamic underwater acoustic channels in the real ocean environment.展开更多
文摘Q-learning is a popular temporal-difference reinforcement learning algorithm which often explicitly stores state values using lookup tables. This implementation has been proven to converge to the optimal solution, but it is often beneficial to use a function-approximation system, such as deep neural networks, to estimate state values. It has been previously observed that Q-learning can be unstable when using value function approximation or when operating in a stochastic environment. This instability can adversely affect the algorithm’s ability to maximize its returns. In this paper, we present a new algorithm called Multi Q-learning to attempt to overcome the instability seen in Q-learning. We test our algorithm on a 4 × 4 grid-world with different stochastic reward functions using various deep neural networks and convolutional networks. Our results show that in most cases, Multi Q-learning outperforms Q-learning, achieving average returns up to 2.5 times higher than Q-learning and having a standard deviation of state values as low as 0.58.
基金supported by the National Natural Science Foundation of China(6177109562031007).
文摘The dwell scheduling problem for a multifunctional radar system is led to the formation of corresponding optimiza-tion problem.In order to solve the resulting optimization prob-lem,the dwell scheduling process in a scheduling interval(SI)is formulated as a Markov decision process(MDP),where the state,action,and reward are specified for this dwell scheduling problem.Specially,the action is defined as scheduling the task on the left side,right side or in the middle of the radar idle time-line,which reduces the action space effectively and accelerates the convergence of the training.Through the above process,a model-free reinforcement learning framework is established.Then,an adaptive dwell scheduling method based on Q-learn-ing is proposed,where the converged Q value table after train-ing is utilized to instruct the scheduling process.Simulation results demonstrate that compared with existing dwell schedul-ing algorithms,the proposed one can achieve better scheduling performance considering the urgency criterion,the importance criterion and the desired execution time criterion comprehen-sively.The average running time shows the proposed algorithm has real-time performance.
文摘针对无监督环境下传统网络异常诊断算法存在异常点定位和异常数据分类准确率低等不足,通过设计一种基于改进Q-learning算法的无线网络异常诊断方法:首先基于ADU(Asynchronous Data Unit异步数据单元)单元采集无线网络的数据流,并提取数据包特征;然后构建Q-learning算法模型探索状态值和奖励值的平衡点,利用SA(Simulated Annealing模拟退火)算法从全局视角对下一时刻状态进行精确识别;最后确定训练样本的联合分布概率,提升输出值的逼近性能以达到平衡探索与代价之间的均衡。测试结果显示:改进Q-learning算法的网络异常定位准确率均值达99.4%,在不同类型网络异常的分类精度和分类效率等方面,也优于三种传统网络异常诊断方法。
文摘随着智能体在复杂动态环境中的路径规划需求日益增长,传统Q-Learning算法在收敛速度、避障效率及全局优化能力上的局限性逐渐凸显。针对Q-Learning算法在路径规划中的不足,本文提出一种结合动态学习率、自适应探索率与蒙特卡洛树搜索(Monte Carlo Tree Search,MCTS)的改进方法。首先,通过引入指数衰减的动态学习率与探索率,以平衡算法在训练初期的探索能力与后期的策略稳定性;其次,将MCTS与Q-Learning结合,利用MCTS的全局搜索特性优化Q值更新过程;此外,融合启发式函数以改进奖励机制,引导智能体更高效地逼近目标。实验结果表明,改进算法的平均步数、收敛速度、稳定性等相较于传统算法提升显著,本研究为复杂环境下的智能体路径规划提供了一种高效、鲁棒的解决方案。
基金the National Key Research and Development Program of China under Grant No.2016YFC1400200in part by the Basic Research Program of Science and Technology of Shenzhen,China under Grant No.JCYJ20190809161805508+2 种基金in part by the Fundamental Research Funds for the Central Universities of China under Grant No.20720200092in part by the Xiamen University’s Honors Program for Undergraduates in Marine Sciences under Grant No.22320152201106in part by the National Natural Science Foundation of China under Grants No.41476026,41976178 and 61801139。
文摘Routing plays a critical role in data transmission for underwater acoustic sensor networks(UWSNs)in the internet of underwater things(IoUT).Traditional routing methods suffer from high end-toend delay,limited bandwidth,and high energy consumption.With the development of artificial intelligence and machine learning algorithms,many researchers apply these new methods to improve the quality of routing.In this paper,we propose a Qlearning-based multi-hop cooperative routing protocol(QMCR)for UWSNs.Our protocol can automatically choose nodes with the maximum Q-value as forwarders based on distance information.Moreover,we combine cooperative communications with Q-learning algorithm to reduce network energy consumption and improve communication efficiency.Experimental results show that the running time of the QMCR is less than one-tenth of that of the artificial fish-swarm algorithm(AFSA),while the routing energy consumption is kept at the same level.Due to the extremely fast speed of the algorithm,the QMCR is a promising method of routing design for UWSNs,especially for the case that it suffers from the extreme dynamic underwater acoustic channels in the real ocean environment.