This paper deals with the continuous time Markov decision programming (briefly CTMDP) withunbounded reward rate.The economic criterion is the long-run average reward. To the models withcountable state space,and compac...This paper deals with the continuous time Markov decision programming (briefly CTMDP) withunbounded reward rate.The economic criterion is the long-run average reward. To the models withcountable state space,and compact metric action sets,we present a set of sufficient conditions to ensurethe existence of the stationary optimal policies.展开更多
Aim To investigate the model free multi step average reward reinforcement learning algorithm. Methods By combining the R learning algorithms with the temporal difference learning (TD( λ ) learning) algorithm...Aim To investigate the model free multi step average reward reinforcement learning algorithm. Methods By combining the R learning algorithms with the temporal difference learning (TD( λ ) learning) algorithms for average reward problems, a novel incremental algorithm, called R( λ ) learning, was proposed. Results and Conclusion The proposed algorithm is a natural extension of the Q( λ) learning, the multi step discounted reward reinforcement learning algorithm, to the average reward cases. Simulation results show that the R( λ ) learning with intermediate λ values makes significant performance improvement over the simple R learning.展开更多
This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival...This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over time.For this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve it.Then,we propose a variant of Q-learning,Q-greedyUCB that combinesε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP problem.We mathematically prove that the Q-greedyUCB algorithm converges to an optimal solution.Simulation results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning withε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative regret.We also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment.展开更多
In a multi-stage manufacturing system,defective components are generated due to deteriorating machine parts and failure to install the feed load.In these circumstances,the system requires inspection counters to distin...In a multi-stage manufacturing system,defective components are generated due to deteriorating machine parts and failure to install the feed load.In these circumstances,the system requires inspection counters to distinguish imperfect items and takes a few discreet decisions to produce impeccable items.Whereas the prioritisation of employee appreciation and working on reward is one of the important policies to improve productivity.Here we look at the multistage manufacturing system as an M/PH/1 queue model and rewards are given for using certain inspection strategies to produce the quality items.A matrix analytical method is proposed to explain a continuous-time Markov process in which the reward points are given to the strategy of inspection in each state of the system.By constructing the value functions of this dynamic programming model,we derive the optimal policy and the optimal average reward of the entire system in the long run.In addition,we obtain the percentage of time spent on each system state for the probability of conformity and non-conformity of the product over the long term.The results of our computational experiments and case study suggest that the average reward increases due to the actions are taken at each decision epoch for rework and disposal of the non-conformity items.展开更多
This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the rewar...This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the reward rates may have neither upper nor lower bounds.Under mild conditions,the authors prove the existence of strong n(n =—1,0)-discount optimal stationary policies by developing two equivalence relations:One is between the standard expected average reward and strong—1-discount optimality,and the other is between the bias and strong 0-discount optimality.The authors also prove the existence of an optimal policy for a finite horizon control problem by developing an interesting characterization of a canonical triplet.展开更多
基金This paper was prepared with the support of the National Youth Science Foundation
文摘This paper deals with the continuous time Markov decision programming (briefly CTMDP) withunbounded reward rate.The economic criterion is the long-run average reward. To the models withcountable state space,and compact metric action sets,we present a set of sufficient conditions to ensurethe existence of the stationary optimal policies.
文摘Aim To investigate the model free multi step average reward reinforcement learning algorithm. Methods By combining the R learning algorithms with the temporal difference learning (TD( λ ) learning) algorithms for average reward problems, a novel incremental algorithm, called R( λ ) learning, was proposed. Results and Conclusion The proposed algorithm is a natural extension of the Q( λ) learning, the multi step discounted reward reinforcement learning algorithm, to the average reward cases. Simulation results show that the R( λ ) learning with intermediate λ values makes significant performance improvement over the simple R learning.
基金This work was supported by the research fund of Hanyang University(HY-2019-N)This work was supported by the National Key Research&Development Program 2018YFA0701601.
文摘This paper proposes a Reinforcement learning(RL)algorithm to find an optimal scheduling policy to minimize the delay for a given energy constraint in communication system where the environments such as traffic arrival rates are not known in advance and can change over time.For this purpose,this problem is formulated as an infinite-horizon Constrained Markov Decision Process(CMDP).To handle the constrained optimization problem,we first adopt the Lagrangian relaxation technique to solve it.Then,we propose a variant of Q-learning,Q-greedyUCB that combinesε-greedy and Upper Confidence Bound(UCB)algorithms to solve this constrained MDP problem.We mathematically prove that the Q-greedyUCB algorithm converges to an optimal solution.Simulation results also show that Q-greedyUCB finds an optimal scheduling strategy,and is more efficient than Q-learning withε-greedy,R-learning and the Averagepayoff RL(ARL)algorithm in terms of the cumulative regret.We also show that our algorithm can learn and adapt to the changes of the environment,so as to obtain an optimal scheduling strategy under a given power constraint for the new environment.
文摘In a multi-stage manufacturing system,defective components are generated due to deteriorating machine parts and failure to install the feed load.In these circumstances,the system requires inspection counters to distinguish imperfect items and takes a few discreet decisions to produce impeccable items.Whereas the prioritisation of employee appreciation and working on reward is one of the important policies to improve productivity.Here we look at the multistage manufacturing system as an M/PH/1 queue model and rewards are given for using certain inspection strategies to produce the quality items.A matrix analytical method is proposed to explain a continuous-time Markov process in which the reward points are given to the strategy of inspection in each state of the system.By constructing the value functions of this dynamic programming model,we derive the optimal policy and the optimal average reward of the entire system in the long run.In addition,we obtain the percentage of time spent on each system state for the probability of conformity and non-conformity of the product over the long term.The results of our computational experiments and case study suggest that the average reward increases due to the actions are taken at each decision epoch for rework and disposal of the non-conformity items.
基金supported by the National Natural Science Foundation of China under Grant Nos.61374080 and 61374067the Natural Science Foundation of Zhejiang Province under Grant No.LY12F03010+1 种基金the Natural Science Foundation of Ningbo under Grant No.2012A610032Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions
文摘This paper studies the strong n(n =—1,0)-discount and finite horizon criteria for continuoustime Markov decision processes in Polish spaces.The corresponding transition rates are allowed to be unbounded,and the reward rates may have neither upper nor lower bounds.Under mild conditions,the authors prove the existence of strong n(n =—1,0)-discount optimal stationary policies by developing two equivalence relations:One is between the standard expected average reward and strong—1-discount optimality,and the other is between the bias and strong 0-discount optimality.The authors also prove the existence of an optimal policy for a finite horizon control problem by developing an interesting characterization of a canonical triplet.