With the complexity of the composition process and the rapid growth of candidate services,realizing optimal or near-optimal service composition is an urgent problem.Currently,the static service composition chain is ri...With the complexity of the composition process and the rapid growth of candidate services,realizing optimal or near-optimal service composition is an urgent problem.Currently,the static service composition chain is rigid and cannot be easily adapted to the dynamic Web environment.To address these challenges,the geographic information service composition(GISC) problem as a sequential decision-making task is modeled.In addition,the Markov decision process(MDP),as a universal model for the planning problem of agents,is used to describe the GISC problem.Then,to achieve self-adaptivity and optimization in a dynamic environment,a novel approach that integrates Monte Carlo tree search(MCTS) and a temporal-difference(TD) learning algorithm is proposed.The concrete services of abstract services are determined with optimal policies and adaptive capability at runtime,based on the environment and the status of component services.The simulation experiment is performed to demonstrate the effectiveness and efficiency through learning quality and performance.展开更多
A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for genera...A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.展开更多
Differential signals are key in control engineering as they anticipate future behavior of process variables and therefore are critical in formulating control laws such as proportional-integral-derivative(PID).The prac...Differential signals are key in control engineering as they anticipate future behavior of process variables and therefore are critical in formulating control laws such as proportional-integral-derivative(PID).The practical challenge,however,is to extract such signals from noisy measurements and this difficulty is addressed first by J.Han in the form of linear and nonlinear tracking differentiator(TD).While improvements were made,TD did not completely resolve the conflict between the noise sensitivity and the accuracy and timeliness of the differentiation.The two approaches proposed in this paper start with the basic linear TD,but apply iterative learning mechanism to the historical data in a moving window(MW),to form two new iterative learning tracking differentiators(IL-TD):one is a parallel IL-TD using an iterative ladder network structure which is implementable in analog circuits;the other a serial IL-TD which is implementable digitally on any computer platform.Both algorithms are validated in simulations which show that the proposed two IL-TDs have better tracking differentiation and de-noise performance compared to the existing linear TD.展开更多
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行...为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.展开更多
Partition testing is one of the most fundamental and popularly used software testing techniques.It first divides the input domain of the program under test into a set of disjoint partitions,and then creates test cases...Partition testing is one of the most fundamental and popularly used software testing techniques.It first divides the input domain of the program under test into a set of disjoint partitions,and then creates test cases based on these partitions.Motivated by the theory of software cybernetics,some strategies have been proposed to dynamically select partitions based on the feedback information gained during testing.The basic intuition of these strategies is to assign higher probabilities to those partitions with higher fault-detection potentials,which are judged and updated mainly according to the previous test results.Such a feedback-driven mechanism can be considered as a learning processit makes decisions based on the observations acquired in the test execution.Accordingly,advanced learning techniques could be leveraged to empower the smart partition selection,with the purpose of further improving the effectiveness and efficiency of partition testing.In this paper,we particularly leverage reinforcement learning to enhance the state-of-the-art adaptive partition testing techniques.Two algorithms,namely RLAPT_Q and RLAPT_S,have been developed to implement the proposed approach.Empirical studies have been conducted to evaluate the performance of the proposed approach based on seven object programs with 26 faults.The experimental results show that our approach outperforms the existing partition testing techniques in terms of the fault-detection capability as well as the overall testing time.Our study demonstrates the applicability and effectiveness of reinforcement learning in advancing the performance of software testing.展开更多
Learning-based algorithm attracts great attention in the autonomous driving control field,especially for decisionmaking,to meet the challenge in long-tail extreme scenarios,where traditional methods demonstrate poor a...Learning-based algorithm attracts great attention in the autonomous driving control field,especially for decisionmaking,to meet the challenge in long-tail extreme scenarios,where traditional methods demonstrate poor adaptability even with a significant effort.To improve the autonomous driving performance in extreme scenarios,specifically consecutive sharp turns,three deep reinforcement learning algorithms,i.e.Deep Deterministic Policy Gradient(DDPG),Twin Delayed Deep Deterministic policy gradient(TD3),and Soft Actor-Critic(SAC),based decision-making policies are proposed in this study.The role of the observation variable in agent training is discussed by comparing the driving stability,average speed,and consumed computational effort of the proposed algorithms in curves with various curvatures.In addition,a novel reward-setting method that combines the states of the environment and the vehicle is proposed to solve the sparse reward problem in the reward-guided algorithm.Simulation results from the road with consecutive sharp turns show that the DDPG,SAC,and TD3 algorithms-based vehicles take 367.2,359.6,and 302.1 s to finish the task,respectively,which match the training results,and verifies the observation variable role in agent quality improvement.展开更多
Domain randomization is a widely adopted technique in deep reinforcement learning(DRL)to improve agent generalization by exposing policies to diverse environmental conditions.This paper investigates the impact of diff...Domain randomization is a widely adopted technique in deep reinforcement learning(DRL)to improve agent generalization by exposing policies to diverse environmental conditions.This paper investigates the impact of different reset strategies,normal,non-randomized,and randomized,on agent performance using the Deep Deterministic Policy Gradient(DDPG)and Twin Delayed DDPG(TD3)algorithms within the CarRacing-v2 environment.Two experimental setups were conducted:an extended training regime with DDPG for 1000 steps per episode across 1000 episodes,and a fast execution setup comparing DDPG and TD3 for 30 episodes with 50 steps per episode under constrained computational resources.A step-based reward scaling mechanism was applied under the randomized reset condition to promote broader state exploration.Experimental results showthat randomized resets significantly enhance learning efficiency and generalization,with DDPG demonstrating superior performance across all reset strategies.In particular,DDPG combined with randomized resets achieves the highest smoothed rewards(reaching approximately 15),best stability,and fastest convergence.These differences are statistically significant,as confirmed by t-tests:DDPG outperforms TD3 under randomized(t=−101.91,p<0.0001),normal(t=−21.59,p<0.0001),and non-randomized(t=−62.46,p<0.0001)reset conditions.The findings underscore the critical role of reset strategy and reward shaping in enhancing the robustness and adaptability of DRL agents in continuous control tasks,particularly in environments where computational efficiency and training stability are crucial.展开更多
针对现有合作学习算法存在频繁通信、能量消耗过大等问题,应用目标跟踪建立任务模型,文章提出一种基于Q学习和TD误差(Q-learning and TD error,QT)的传感器节点任务调度算法。具体包括将传感器节点任务调度问题映射成Q学习可解决的学习...针对现有合作学习算法存在频繁通信、能量消耗过大等问题,应用目标跟踪建立任务模型,文章提出一种基于Q学习和TD误差(Q-learning and TD error,QT)的传感器节点任务调度算法。具体包括将传感器节点任务调度问题映射成Q学习可解决的学习问题,建立邻居节点间的协作机制以及定义延迟回报、状态空间等基本学习元素。在协作机制中,QT使得传感器节点利用个体和群体的TD误差,通过动态改变自身的学习速度来平衡自身利益和群体利益。此外,QT根据Metropolis准则提高节点学习前期的探索概率,优化任务选择。实验结果表明:QT具备根据当前环境进行动态调度任务的能力;相比其他任务调度算法,QT消耗合理的能量使得单位性能提高了17.26%。展开更多
基金Supported by the National Natural Science Foundation of China(No.41971356,41671400,41701446)National Key Research and Development Program of China(No.2017YFB0503600,2018YFB0505500)Hubei Province Natural Science Foundation of China(No.2017CFB277)。
文摘With the complexity of the composition process and the rapid growth of candidate services,realizing optimal or near-optimal service composition is an urgent problem.Currently,the static service composition chain is rigid and cannot be easily adapted to the dynamic Web environment.To address these challenges,the geographic information service composition(GISC) problem as a sequential decision-making task is modeled.In addition,the Markov decision process(MDP),as a universal model for the planning problem of agents,is used to describe the GISC problem.Then,to achieve self-adaptivity and optimization in a dynamic environment,a novel approach that integrates Monte Carlo tree search(MCTS) and a temporal-difference(TD) learning algorithm is proposed.The concrete services of abstract services are determined with optimal policies and adaptive capability at runtime,based on the environment and the status of component services.The simulation experiment is performed to demonstrate the effectiveness and efficiency through learning quality and performance.
文摘A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.
基金supported by National Natural Science Foundation of China(61773170,62173151)the Natural Science Foundation of Guangdong Province(2023A1515010949,2021A1515011850).
文摘Differential signals are key in control engineering as they anticipate future behavior of process variables and therefore are critical in formulating control laws such as proportional-integral-derivative(PID).The practical challenge,however,is to extract such signals from noisy measurements and this difficulty is addressed first by J.Han in the form of linear and nonlinear tracking differentiator(TD).While improvements were made,TD did not completely resolve the conflict between the noise sensitivity and the accuracy and timeliness of the differentiation.The two approaches proposed in this paper start with the basic linear TD,but apply iterative learning mechanism to the historical data in a moving window(MW),to form two new iterative learning tracking differentiators(IL-TD):one is a parallel IL-TD using an iterative ladder network structure which is implementable in analog circuits;the other a serial IL-TD which is implementable digitally on any computer platform.Both algorithms are validated in simulations which show that the proposed two IL-TDs have better tracking differentiation and de-noise performance compared to the existing linear TD.
文摘为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.
基金supported by the National Natural Science Foundation of China under Grant Nos.62272037 and 61872039the Beijing Natural Science Foundation under Grant No.4162040+2 种基金the Aeronautical Science Foundation of China under Grant No.2016ZD74004the Fundamental Research Funds for the Central Universities of China under Grant No.FRF-GF-19-B19the Australian Research Council Discovery Project under Grant No.DP210102447.
文摘Partition testing is one of the most fundamental and popularly used software testing techniques.It first divides the input domain of the program under test into a set of disjoint partitions,and then creates test cases based on these partitions.Motivated by the theory of software cybernetics,some strategies have been proposed to dynamically select partitions based on the feedback information gained during testing.The basic intuition of these strategies is to assign higher probabilities to those partitions with higher fault-detection potentials,which are judged and updated mainly according to the previous test results.Such a feedback-driven mechanism can be considered as a learning processit makes decisions based on the observations acquired in the test execution.Accordingly,advanced learning techniques could be leveraged to empower the smart partition selection,with the purpose of further improving the effectiveness and efficiency of partition testing.In this paper,we particularly leverage reinforcement learning to enhance the state-of-the-art adaptive partition testing techniques.Two algorithms,namely RLAPT_Q and RLAPT_S,have been developed to implement the proposed approach.Empirical studies have been conducted to evaluate the performance of the proposed approach based on seven object programs with 26 faults.The experimental results show that our approach outperforms the existing partition testing techniques in terms of the fault-detection capability as well as the overall testing time.Our study demonstrates the applicability and effectiveness of reinforcement learning in advancing the performance of software testing.
文摘Learning-based algorithm attracts great attention in the autonomous driving control field,especially for decisionmaking,to meet the challenge in long-tail extreme scenarios,where traditional methods demonstrate poor adaptability even with a significant effort.To improve the autonomous driving performance in extreme scenarios,specifically consecutive sharp turns,three deep reinforcement learning algorithms,i.e.Deep Deterministic Policy Gradient(DDPG),Twin Delayed Deep Deterministic policy gradient(TD3),and Soft Actor-Critic(SAC),based decision-making policies are proposed in this study.The role of the observation variable in agent training is discussed by comparing the driving stability,average speed,and consumed computational effort of the proposed algorithms in curves with various curvatures.In addition,a novel reward-setting method that combines the states of the environment and the vehicle is proposed to solve the sparse reward problem in the reward-guided algorithm.Simulation results from the road with consecutive sharp turns show that the DDPG,SAC,and TD3 algorithms-based vehicles take 367.2,359.6,and 302.1 s to finish the task,respectively,which match the training results,and verifies the observation variable role in agent quality improvement.
基金supported by the Deputyship for Research&Innovation,Ministry of Education in Saudi Arabia(Project No.MoE-IF-UJ-R2-22-04220773-1).
文摘Domain randomization is a widely adopted technique in deep reinforcement learning(DRL)to improve agent generalization by exposing policies to diverse environmental conditions.This paper investigates the impact of different reset strategies,normal,non-randomized,and randomized,on agent performance using the Deep Deterministic Policy Gradient(DDPG)and Twin Delayed DDPG(TD3)algorithms within the CarRacing-v2 environment.Two experimental setups were conducted:an extended training regime with DDPG for 1000 steps per episode across 1000 episodes,and a fast execution setup comparing DDPG and TD3 for 30 episodes with 50 steps per episode under constrained computational resources.A step-based reward scaling mechanism was applied under the randomized reset condition to promote broader state exploration.Experimental results showthat randomized resets significantly enhance learning efficiency and generalization,with DDPG demonstrating superior performance across all reset strategies.In particular,DDPG combined with randomized resets achieves the highest smoothed rewards(reaching approximately 15),best stability,and fastest convergence.These differences are statistically significant,as confirmed by t-tests:DDPG outperforms TD3 under randomized(t=−101.91,p<0.0001),normal(t=−21.59,p<0.0001),and non-randomized(t=−62.46,p<0.0001)reset conditions.The findings underscore the critical role of reset strategy and reward shaping in enhancing the robustness and adaptability of DRL agents in continuous control tasks,particularly in environments where computational efficiency and training stability are crucial.
文摘针对现有合作学习算法存在频繁通信、能量消耗过大等问题,应用目标跟踪建立任务模型,文章提出一种基于Q学习和TD误差(Q-learning and TD error,QT)的传感器节点任务调度算法。具体包括将传感器节点任务调度问题映射成Q学习可解决的学习问题,建立邻居节点间的协作机制以及定义延迟回报、状态空间等基本学习元素。在协作机制中,QT使得传感器节点利用个体和群体的TD误差,通过动态改变自身的学习速度来平衡自身利益和群体利益。此外,QT根据Metropolis准则提高节点学习前期的探索概率,优化任务选择。实验结果表明:QT具备根据当前环境进行动态调度任务的能力;相比其他任务调度算法,QT消耗合理的能量使得单位性能提高了17.26%。