期刊文献+
共找到5篇文章
< 1 >
每页显示 20 50 100
CONTINUOUS TIME MARKOV DECISION PROGRAMMING WITH AVERAGE REWARD CRITERION AND UNBOUNDED REWARD RATE
1
作者 郑少慧 《Acta Mathematicae Applicatae Sinica》 SCIE CSCD 1991年第1期6-16,共11页
This paper deals with the continuous time Markov decision programming (briefly CTMDP) withunbounded reward rate.The economic criterion is the long-run average reward. To the models withcountable state space,and compac... This paper deals with the continuous time Markov decision programming (briefly CTMDP) withunbounded reward rate.The economic criterion is the long-run average reward. To the models withcountable state space,and compact metric action sets,we present a set of sufficient conditions to ensurethe existence of the stationary optimal policies. 展开更多
关键词 CONTINUOUS TIME MARKOV DECISION PROGRAMMING WITH average reward CRITERION AND UNBOUNDED reward RATE CTMDP
原文传递
Incremental Multi Step R Learning
2
作者 胡光华 吴沧浦 《Journal of Beijing Institute of Technology》 EI CAS 1999年第3期245-250,共6页
Aim To investigate the model free multi step average reward reinforcement learning algorithm. Methods By combining the R learning algorithms with the temporal difference learning (TD( λ ) learning) algorithm... Aim To investigate the model free multi step average reward reinforcement learning algorithm. Methods By combining the R learning algorithms with the temporal difference learning (TD( λ ) learning) algorithms for average reward problems, a novel incremental algorithm, called R( λ ) learning, was proposed. Results and Conclusion The proposed algorithm is a natural extension of the Q( λ) learning, the multi step discounted reward reinforcement learning algorithm, to the average reward cases. Simulation results show that the R( λ ) learning with intermediate λ values makes significant performance improvement over the simple R learning. 展开更多
关键词 reinforcement learning average reward R learning Markov decision processes temporal difference learning
在线阅读 下载PDF
Q-greedyUCB: a New Exploration Policy to Learn Resource-Efficient Scheduling
3
作者 Yu Zhao Joohyun Lee Wei Chen 《China Communications》 SCIE CSCD 2021年第6期12-23,共12页
This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival... This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over time.For this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve it.Then,we propose a variant of Q-learning,Q-greedyUCB that combinesε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP problem.We mathematically prove that the Q-greedyUCB algorithm converges to an optimal solution.Simulation results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning withε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative regret.We also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment. 展开更多
关键词 reinforcement learning for average rewards infinite-horizon Markov decision process upper confidence bound queue scheduling
在线阅读 下载PDF
Inspection strategies for quality products withrewards in a multi-stage production
4
作者 R.Satheesh Kumar A.Nagarajan 《Journal of Control and Decision》 EI 2023年第4期596-609,共14页
In a multi-stage manufacturing system,defective components are generated due to deteriorating machine parts and failure to install the feed load.In these circumstances,the system requires inspection counters to distin... In a multi-stage manufacturing system,defective components are generated due to deteriorating machine parts and failure to install the feed load.In these circumstances,the system requires inspection counters to distinguish imperfect items and takes a few discreet decisions to produce impeccable items.Whereas the prioritisation of employee appreciation and working on reward is one of the important policies to improve productivity.Here we look at the multistage manufacturing system as an M/PH/1 queue model and rewards are given for using certain inspection strategies to produce the quality items.A matrix analytical method is proposed to explain a continuous-time Markov process in which the reward points are given to the strategy of inspection in each state of the system.By constructing the value functions of this dynamic programming model,we derive the optimal policy and the optimal average reward of the entire system in the long run.In addition,we obtain the percentage of time spent on each system state for the probability of conformity and non-conformity of the product over the long term.The results of our computational experiments and case study suggest that the average reward increases due to the actions are taken at each decision epoch for rework and disposal of the non-conformity items. 展开更多
关键词 Sequential process infinitesimal operator Markov decision processes value function optimal policy optimal average reward
原文传递
STRONG N-DISCOUNT AND FINITE-HORIZON OPTIMALITY FOR CONTINUOUS-TIME MARKOV DECISION PROCESSES 被引量:1
5
作者 ZHU Quanxin GUO Xianping 《Journal of Systems Science & Complexity》 SCIE EI CSCD 2014年第5期1045-1063,共19页
This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the rewar... This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the reward rates may have neither upper nor lower bounds.Under mild conditions,the authors prove the existence of strong n(n =—1,0)-discount optimal stationary policies by developing two equivalence relations:One is between the standard expected average reward and strong—1-discount optimality,and the other is between the bias and strong 0-discount optimality.The authors also prove the existence of an optimal policy for a finite horizon control problem by developing an interesting characterization of a canonical triplet. 展开更多
关键词 Continuous-time Markov decision process expected average reward criterion finite-horizon optimality Polish space strong n-discount optimality
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部