The process of making decisions is something humans do inherently and routinely,to the extent that it appears commonplace. However,in order to achieve good overall performance,decisions must take into account both the...The process of making decisions is something humans do inherently and routinely,to the extent that it appears commonplace. However,in order to achieve good overall performance,decisions must take into account both the outcomes of past decisions and opportunities of future ones. Reinforcement learning,which is fundamental to sequential decision-making,consists of the following components: 1 A set of decisions epochs; 2 A set of environment states; 3 A set of available actions to transition states; 4 State-action dependent immediate rewards for each action.At each decision,the environment state provides the decision maker with a set of available actions from which to choose. As a result of selecting a particular action in the state,the environment generates an immediate reward for the decision maker and shifts to a different state and decision. The ultimate goal for the decision maker is to maximize the total reward after a sequence of time steps.This paper will focus on an archetypal example of reinforcement learning,the stochastic multi-armed bandit problem. After introducing the dilemma,I will briefly cover the most common methods used to solve it,namely the UCB and εn- greedy algorithms. I will also introduce my own greedy implementation,the strict-greedy algorithm,which more tightly follows the greedy pattern in algorithm design,and show that it runs comparably to the two accepted algorithms.展开更多
Artificial intelligence has permeated all aspects of our lives today. However, to make AI behave like real AI, the critical bottleneck lies in the speed of computing. Quantum computers employ the peculiar and unique p...Artificial intelligence has permeated all aspects of our lives today. However, to make AI behave like real AI, the critical bottleneck lies in the speed of computing. Quantum computers employ the peculiar and unique properties of quantum states such as superposition, entanglement, and interference to process information in ways that classical computers cannot. As a new paradigm of computation, quantum computers are capable of performing tasks intractable for classical processors, thus providing a quantum leap in AI research and making the development of real AI a possibility. In this regard, quantum machine learning not only enhances the classical machine learning approach but more importantly it provides an avenue to explore new machine learning models that have no classical counterparts. The qubit-based quantum computers cannot naturally represent the continuous variables commonly used in machine learning, since the measurement outputs of qubit-based circuits are generally discrete. Therefore, a continuous-variable (CV) quantum architecture based on a photonic quantum computing model is selected for our study. In this work, we employ machine learning and optimization to create photonic quantum circuits that can solve the contextual multi-armed bandit problem, a problem in the domain of reinforcement learning, which demonstrates that quantum reinforcement learning algorithms can be learned by a quantum device.展开更多
The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss functio...The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss function,our approach stands out from conventional practices by directly minimizing the loss itself.Leveraging the large deviation theory,we firstly provide an asymptotically optimal allocation rule for the TB problem,and then propose a parameter-free Large Deviation(LD)algorithm to make the allocation rule implementable.Central limit theorem-based Large Deviation(CLD)algorithm is further proposed as a supplement to improve the computation efficiency using normal approximation.Extensive experiments are conducted to validate the superiority of our algorithms compared to existing methods,and demonstrate their broader applications to more general distributions and various kinds of loss functions.展开更多
In order to cope with the increasing threat of the ballistic missile(BM)in a shorter reaction time,the shooting policy of the layered defense system needs to be optimized.The main decisionmaking problem of shooting op...In order to cope with the increasing threat of the ballistic missile(BM)in a shorter reaction time,the shooting policy of the layered defense system needs to be optimized.The main decisionmaking problem of shooting optimization is how to choose the next BM which needs to be shot according to the previous engagements and results,thus maximizing the expected return of BMs killed or minimizing the cost of BMs penetration.Motivated by this,this study aims to determine an optimal shooting policy for a two-layer missile defense(TLMD)system.This paper considers a scenario in which the TLMD system wishes to shoot at a collection of BMs one at a time,and to maximize the return obtained from BMs killed before the system demise.To provide a policy analysis tool,this paper develops a general model for shooting decision-making,the shooting engagements can be described as a discounted reward Markov decision process.The index shooting policy is a strategy that can effectively balance the shooting returns and the risk that the defense mission fails,and the goal is to maximize the return obtained from BMs killed before the system demise.The numerical results show that the index policy is better than a range of competitors,especially the mean returns and the mean killing BM number.展开更多
文摘The process of making decisions is something humans do inherently and routinely,to the extent that it appears commonplace. However,in order to achieve good overall performance,decisions must take into account both the outcomes of past decisions and opportunities of future ones. Reinforcement learning,which is fundamental to sequential decision-making,consists of the following components: 1 A set of decisions epochs; 2 A set of environment states; 3 A set of available actions to transition states; 4 State-action dependent immediate rewards for each action.At each decision,the environment state provides the decision maker with a set of available actions from which to choose. As a result of selecting a particular action in the state,the environment generates an immediate reward for the decision maker and shifts to a different state and decision. The ultimate goal for the decision maker is to maximize the total reward after a sequence of time steps.This paper will focus on an archetypal example of reinforcement learning,the stochastic multi-armed bandit problem. After introducing the dilemma,I will briefly cover the most common methods used to solve it,namely the UCB and εn- greedy algorithms. I will also introduce my own greedy implementation,the strict-greedy algorithm,which more tightly follows the greedy pattern in algorithm design,and show that it runs comparably to the two accepted algorithms.
文摘Artificial intelligence has permeated all aspects of our lives today. However, to make AI behave like real AI, the critical bottleneck lies in the speed of computing. Quantum computers employ the peculiar and unique properties of quantum states such as superposition, entanglement, and interference to process information in ways that classical computers cannot. As a new paradigm of computation, quantum computers are capable of performing tasks intractable for classical processors, thus providing a quantum leap in AI research and making the development of real AI a possibility. In this regard, quantum machine learning not only enhances the classical machine learning approach but more importantly it provides an avenue to explore new machine learning models that have no classical counterparts. The qubit-based quantum computers cannot naturally represent the continuous variables commonly used in machine learning, since the measurement outputs of qubit-based circuits are generally discrete. Therefore, a continuous-variable (CV) quantum architecture based on a photonic quantum computing model is selected for our study. In this work, we employ machine learning and optimization to create photonic quantum circuits that can solve the contextual multi-armed bandit problem, a problem in the domain of reinforcement learning, which demonstrates that quantum reinforcement learning algorithms can be learned by a quantum device.
基金supported by the Natural Science Foundation of Guangdong Province(No.2023A1515011667)the Science and Technology Major Project of Shenzhen(No.KJZD20230923114809020)+3 种基金the Key Basic Research Foundation of Shenzhen(No.JCYJ20220818100205012)the Guangdong Basic and Applied Basic Research Foundation(No.2023B1515120020)the Shenzhen Science and Technology Program(No.RCBS20221008093331068)the Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone Project(No.HZQSWS-KCCYB-2024016).
文摘The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss function,our approach stands out from conventional practices by directly minimizing the loss itself.Leveraging the large deviation theory,we firstly provide an asymptotically optimal allocation rule for the TB problem,and then propose a parameter-free Large Deviation(LD)algorithm to make the allocation rule implementable.Central limit theorem-based Large Deviation(CLD)algorithm is further proposed as a supplement to improve the computation efficiency using normal approximation.Extensive experiments are conducted to validate the superiority of our algorithms compared to existing methods,and demonstrate their broader applications to more general distributions and various kinds of loss functions.
基金supported by the National Natural Science Foundation of China(7170120971771216)+1 种基金Shaanxi Natural Science Foundation(2019JQ-250)China Post-doctoral Fund(2019M653962)
文摘In order to cope with the increasing threat of the ballistic missile(BM)in a shorter reaction time,the shooting policy of the layered defense system needs to be optimized.The main decisionmaking problem of shooting optimization is how to choose the next BM which needs to be shot according to the previous engagements and results,thus maximizing the expected return of BMs killed or minimizing the cost of BMs penetration.Motivated by this,this study aims to determine an optimal shooting policy for a two-layer missile defense(TLMD)system.This paper considers a scenario in which the TLMD system wishes to shoot at a collection of BMs one at a time,and to maximize the return obtained from BMs killed before the system demise.To provide a policy analysis tool,this paper develops a general model for shooting decision-making,the shooting engagements can be described as a discounted reward Markov decision process.The index shooting policy is a strategy that can effectively balance the shooting returns and the risk that the defense mission fails,and the goal is to maximize the return obtained from BMs killed before the system demise.The numerical results show that the index policy is better than a range of competitors,especially the mean returns and the mean killing BM number.