期刊文献+
共找到370篇文章
< 1 2 19 >
每页显示 20 50 100
Optimal Policies for Quantum Markov Decision Processes 被引量:2
1
作者 Ming-Sheng Ying Yuan Feng Sheng-Gang Ying 《International Journal of Automation and computing》 EI CSCD 2021年第3期410-421,共12页
Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper intro... Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world. 展开更多
关键词 Quantum markov decision processes quantum machine learning reinforcement learning dynamic programming decision making
原文传递
Solving Markov Decision Processes with Downside Risk Adjustment 被引量:1
2
作者 Abhijit Gosavi Anish Parulekar 《International Journal of Automation and computing》 EI CSCD 2016年第3期235-245,共11页
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral cr... Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance. 展开更多
关键词 Downside risk markov decision processes reinforcement learning dynamic programming TARGETS thresholds.
原文传递
Approximate Dynamic Programming for Stochastic Resource Allocation Problems 被引量:4
3
作者 Ali Forootani Raffaele Iervolino +1 位作者 Massimo Tipaldi Joshua Neilson 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2020年第4期975-990,共16页
A stochastic resource allocation model, based on the principles of Markov decision processes(MDPs), is proposed in this paper. In particular, a general-purpose framework is developed, which takes into account resource... A stochastic resource allocation model, based on the principles of Markov decision processes(MDPs), is proposed in this paper. In particular, a general-purpose framework is developed, which takes into account resource requests for both instant and future needs. The considered framework can handle two types of reservations(i.e., specified and unspecified time interval reservation requests), and implement an overbooking business strategy to further increase business revenues. The resulting dynamic pricing problems can be regarded as sequential decision-making problems under uncertainty, which is solved by means of stochastic dynamic programming(DP) based algorithms. In this regard, Bellman’s backward principle of optimality is exploited in order to provide all the implementation mechanisms for the proposed reservation pricing algorithm. The curse of dimensionality, as the inevitable issue of the DP both for instant resource requests and future resource reservations,occurs. In particular, an approximate dynamic programming(ADP) technique based on linear function approximations is applied to solve such scalability issues. Several examples are provided to show the effectiveness of the proposed approach. 展开更多
关键词 Approximate dynamic programming(ADP) dynamic programming(DP) markov decision processes(MDPs) resource allocation problem
在线阅读 下载PDF
Performance Potential-based Neuro-dynamic Programming for SMDPs 被引量:10
4
作者 TANGHao YUANJi-Bin LUYang CHENGWen-Juan 《自动化学报》 EI CSCD 北大核心 2005年第4期642-645,共4页
An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their... An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their performance measures and performance potentials, the optimiza-tion of an SMDP can be realized by simulating the chain. For the critic model of neuro-dynamicprogramming (NDP), a neuro-policy iteration (NPI) algorithm is presented, and the performanceerror bound is shown as there are approximate error and improvement error in each iteration step.The obtained results may be extended to Markov systems, and have much applicability. Finally, anumerical example is provided. 展开更多
关键词 决议过程 SMDP 执行电位 神经动力学 markov 优化设计
在线阅读 下载PDF
Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters
5
作者 Cheng-Jun Hou 《Journal of the Operations Research Society of China》 2025年第2期484-514,共31页
The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article.We introduce scenario belief to describe the probability distribution of scenario... The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article.We introduce scenario belief to describe the probability distribution of scenarios in the system,and scenario expectation to formulate the expected total discounted reward of a policy.We establish a new framework named as double-factored Markov decision process(DFMDP),in which the physical state and scenario belief are shown to be the double factors serving as the sufficient statistics for the history of the decision process.Four classes of policies for the finite horizon DFMDPs are studied and it is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies.We also formulate the infinite horizon DFMDPs and present its optimality equation in this paper.An exact solution method named as double-factored backward induction for the finite horizon DFMDPs is proposed.It is utilized to find the optimal policies for the numeric examples and then compared with policies derived from other methods from the related literatures. 展开更多
关键词 dynamic programming markov decision process Parameter uncertainty Multiple scenarios of the parameters Double-factored markov decision process
原文传递
Condition-based maintenance via Markov decision processes: A review
6
作者 Xiujie ZHAO Piao CHEN Loon Ching TANG 《Frontiers of Engineering Management》 2025年第2期330-342,共13页
The optimization of condition-based maintenance (CBM) poses challenges due to the rapid advancement of monitoring technologies. Traditional CBM research has mainly relied on theory-driven approaches, which lead to the... The optimization of condition-based maintenance (CBM) poses challenges due to the rapid advancement of monitoring technologies. Traditional CBM research has mainly relied on theory-driven approaches, which lead to the development of several effective maintenance models characterized by their wide applicability and attractiveness. However, when the system reliability model becomes complex, such methods may run into intractable cost models. The Markov decision process (MDP), a classic framework for sequential decision making, has drawn increasing attention for optimization of CBM optimization due to its appealing tractability and pragmatic applicability across different problems. This paper presents a review of research that optimizes CBM policies using MDP, with a focus on mathematical modeling and optimization methods to enable action. We have organized the review around several key components that are subject to similar mathematical modeling constraints, including system complexity, the availability of system conditions, and diverse criteria of decision-makers. An increase in interest has led to the optimization of CBM for systems possessing increasing numbers of components and sensors. Then, the review focuses on joint optimization problems with CBM. Finally, as an important extension to traditional MDPs, reinforcement learning (RL) based methods are also reviewed as ways to optimize CBM policies. This paper provides significant background research for researchers and practitioners working in reliability and maintenance management, and gives discussions on possible future research directions. 展开更多
关键词 RELIABILITY degradation modeling dynamic programming reinforcement learning sequential decision problems
原文传递
AquaTree:Deep Reinforcement Learning-Driven Monte Carlo Tree Search for Underwater Image Enhancement
7
作者 Chao Li Jianing Wang +1 位作者 Caichang Ding Zhiwei Ye 《Computers, Materials & Continua》 2026年第3期1444-1464,共21页
Underwater images frequently suffer from chromatic distortion,blurred details,and low contrast,posing significant challenges for enhancement.This paper introduces AquaTree,a novel underwater image enhancement(UIE)meth... Underwater images frequently suffer from chromatic distortion,blurred details,and low contrast,posing significant challenges for enhancement.This paper introduces AquaTree,a novel underwater image enhancement(UIE)method that reformulates the task as a Markov Decision Process(MDP)through the integration of Monte Carlo Tree Search(MCTS)and deep reinforcement learning(DRL).The framework employs an action space of 25 enhancement operators,strategically grouped for basic attribute adjustment,color component balance,correction,and deblurring.Exploration within MCTS is guided by a dual-branch convolutional network,enabling intelligent sequential operator selection.Our core contributions include:(1)a multimodal state representation combining CIELab color histograms with deep perceptual features,(2)a dual-objective reward mechanism optimizing chromatic fidelity and perceptual consistency,and(3)an alternating training strategy co-optimizing enhancement sequences and network parameters.We further propose two inference schemes:an MCTS-based approach prioritizing accuracy at higher computational cost,and an efficient network policy enabling real-time processing with minimal quality loss.Comprehensive evaluations on the UIEB Dataset and Color correction and haze removal comparisons on the U45 Dataset demonstrate AquaTree’s superiority,significantly outperforming nine state-of-the-art methods across five established underwater image quality metrics. 展开更多
关键词 Underwater image enhancement(UIE) Monte Carlo tree search(MCTS) deep reinforcement learning(DRL) markov decision process(MDP)
在线阅读 下载PDF
SINGULARLY PERTURBED MARKOV DECISION PROCESSES WITH INCLUSION OF TRANSIENT STATES 被引量:1
8
作者 R.H.Liu Q.Zhang G.Yin 《Journal of Systems Science & Complexity》 SCIE EI CSCD 2001年第2期199-211,共13页
This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed int... This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed into several groups of recurrent states and a group of transient states resulting in a singularly perturbed MDP formulation. Instead of solving the original problem directly, a limit problem that is much simpler to handle is derived. On the basis of the optical solution of the limit problem, nearly optimal decisions are constructed for the original problem. The asymptotic optimality of the constructed control is obtained; the rate of convergence is ascertained. 展开更多
关键词 markov decision process dynamic programming asymptotically optimal control.
原文传递
Deep Reinforcement Learning for Zero-Shot Coverage Path Planning With Mobile Robots
9
作者 JoséPedro Carvalho A.Pedro Aguiar 《IEEE/CAA Journal of Automatica Sinica》 2025年第8期1594-1609,共16页
The ability of mobile robots to plan and execute a path is foundational to various path-planning challenges,particularly Coverage Path Planning.While this task has been typically tackled with classical algorithms,thes... The ability of mobile robots to plan and execute a path is foundational to various path-planning challenges,particularly Coverage Path Planning.While this task has been typically tackled with classical algorithms,these often struggle with flexibility and adaptability in unknown environments.On the other hand,recent advances in Reinforcement Learning offer promising approaches,yet a significant gap in the literature remains when it comes to generalization over a large number of parameters.This paper presents a unified,generalized framework for coverage path planning that leverages value-based deep reinforcement learning techniques.The novelty of the framework comes from the design of an observation space that accommodates different map sizes,an action masking scheme that guarantees safety and robustness while also serving as a learning-fromdemonstration technique during training,and a unique reward function that yields value functions that are size-invariant.These are coupled with a curriculum learning-based training strategy and parametric environment randomization,enabling the agent to tackle complete or partial coverage path planning with perfect or incomplete knowledge while generalizing to different map sizes,configurations,sensor payloads,and sub-tasks.Our empirical results show that the algorithm can perform zero-shot learning scenarios at a near-optimal level in environments that follow a similar distribution as during training,outperforming a greedy heuristic by sixfold.Furthermore,in out-of-distribution environments,our method surpasses existing state-of-the-art algorithms in most zero-shot and all few-shot scenarios,paving the way for generalizable and adaptable path-planning algorithms. 展开更多
关键词 Autonomous robots coverage path planning deep reinforcement learning mobile robot partially observable markov decision processes path planning zero-shot generalization
在线阅读 下载PDF
Bearings-Only Target Motion Analysis via Deep Reinforcement Learning
10
作者 Chengyi Zhou Meiqin Liu +2 位作者 Senlin Zhang Ronghao Zheng Shanling Dong 《IEEE/CAA Journal of Automatica Sinica》 2025年第6期1298-1300,共3页
Dear Editor,This letter introduces a novel approach to address the bearings-only target motion analysis(BO-TMA)problem by incorporating deep reinforcement learning(DRL)techniques.Conventional methods often exhibit bia... Dear Editor,This letter introduces a novel approach to address the bearings-only target motion analysis(BO-TMA)problem by incorporating deep reinforcement learning(DRL)techniques.Conventional methods often exhibit biases and struggle to achieve accurate results,especially when confronted with high levels of noise.In this letter,we formulate the BO-TMA problem as a Markov decision process(MDP)and process it within a DRL framework.Simulation results demonstrate that the proposed DRL-based estimator achieves reduced bias and lower errors compared to existing estimators. 展开更多
关键词 deep reinforcement ESTIMATOR markov decision process errors BIAS deep reinforcement learning markov decision process mdp bearings only target motion analysis
在线阅读 下载PDF
Intelligent Scheduling of Virtual Power Plants Based on Deep Reinforcement Learning
11
作者 Shaowei He Wenchao Cui +3 位作者 Gang Li Hairun Xu Xiang Chen Yu Tai 《Computers, Materials & Continua》 2025年第7期861-886,共26页
The Virtual Power Plant(VPP),as an innovative power management architecture,achieves flexible dispatch and resource optimization of power systems by integrating distributed energy resources.However,due to significant ... The Virtual Power Plant(VPP),as an innovative power management architecture,achieves flexible dispatch and resource optimization of power systems by integrating distributed energy resources.However,due to significant differences in operational costs and flexibility of various types of generation resources,as well as the volatility and uncertainty of renewable energy sources(such as wind and solar power)and the complex variability of load demand,the scheduling optimization of virtual power plants has become a critical issue that needs to be addressed.To solve this,this paper proposes an intelligent scheduling method for virtual power plants based on Deep Reinforcement Learning(DRL),utilizing Deep Q-Networks(DQN)for real-time optimization scheduling of dynamic peaking unit(DPU)and stable baseload unit(SBU)in the virtual power plant.By modeling the scheduling problem as a Markov Decision Process(MDP)and designing an optimization objective function that integrates both performance and cost,the scheduling efficiency and economic performance of the virtual power plant are significantly improved.Simulation results show that,compared with traditional scheduling methods and other deep reinforcement learning algorithms,the proposed method demonstrates significant advantages in key performance indicators:response time is shortened by up to 34%,task success rate is increased by up to 46%,and costs are reduced by approximately 26%.Experimental results verify the efficiency and scalability of the method under complex load environments and the volatility of renewable energy,providing strong technical support for the intelligent scheduling of virtual power plants. 展开更多
关键词 Deep reinforcement learning deep q-network virtual power plant lntelligent scheduling markov decision process
在线阅读 下载PDF
Deep Reinforcement Learning-based Multi-Objective Scheduling for Distributed Heterogeneous Hybrid Flow Shops with Blocking Constraints
12
作者 Xueyan Sun Weiming Shen +3 位作者 Jiaxin Fan Birgit Vogel-Heuser Fandi Bi Chunjiang Zhang 《Engineering》 2025年第3期278-291,共14页
This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved pr... This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality. 展开更多
关键词 Multi-objective markov decision process Multi-agent deep reinforcement learning Proximal policy optimization Distributed hybrid flow-shop scheduling Blocking constraints
在线阅读 下载PDF
基于MDP和Q-learning的绿色移动边缘计算任务卸载策略 被引量:1
13
作者 赵宏伟 吕盛凱 +2 位作者 庞芷茜 马子涵 李雨 《河南理工大学学报(自然科学版)》 北大核心 2025年第5期9-16,共8页
目的为了在汽车、空调等制造类工业互联网企业中实现碳中和,利用边缘计算任务卸载技术处理生产设备的任务卸载问题,以减少服务器的中心负载,减少数据中心的能源消耗和碳排放。方法提出一种基于马尔可夫决策过程(Markov decision process... 目的为了在汽车、空调等制造类工业互联网企业中实现碳中和,利用边缘计算任务卸载技术处理生产设备的任务卸载问题,以减少服务器的中心负载,减少数据中心的能源消耗和碳排放。方法提出一种基于马尔可夫决策过程(Markov decision process,MDP)和Q-learning的绿色边缘计算任务卸载策略,该策略考虑了计算频率、传输功率、碳排放等约束,基于云边端协同计算模型,将碳排放优化问题转化为混合整数线性规划模型,通过MDP和Q-learning求解模型,并对比随机分配算法、Q-learning算法、SARSA(state action reward state action)算法的收敛性能、碳排放与总时延。结果与已有的计算卸载策略相比,新策略对应的任务调度算法收敛比SARSA算法、Q-learning算法分别提高了5%,2%,收敛性更好;系统碳排放成本比Q-learning算法、SARSA算法分别减少了8%,22%;考虑终端数量多少,新策略比Q-learning算法、SARSA算法终端数量分别减少了6%,7%;系统总计算时延上,新策略明显低于其他算法,比随机分配算法、Q-learning算法、SARSA算法分别减少了27%,14%,22%。结论该策略能够合理优化卸载计算任务和资源分配,权衡时延、能耗,减少系统碳排放量。 展开更多
关键词 碳排放 边缘计算 强化学习 马尔可夫决策过程 任务卸载
在线阅读 下载PDF
Reinforcement Learning-Based Joint Task Offloading and Migration Schemes Optimization in Mobility-Aware MEC Network 被引量:12
14
作者 Dongyu Wang Xinqiao Tian +1 位作者 Haoran Cui Zhaolin Liu 《China Communications》 SCIE CSCD 2020年第8期31-44,共14页
Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of... Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of the mobility of mobile equipments(MEs), if MEs move among the reach of the small cell networks(SCNs), the offloaded tasks cannot be returned to MEs successfully. As a result, migration incurs additional costs. In this paper, joint task offloading and migration schemes in mobility-aware Mobile Edge Computing(MEC) network based on Reinforcement Learning(RL) are proposed to obtain the maximum system revenue. Firstly, the joint optimization problems of maximizing the total revenue of MEs are put forward, in view of the mobility-aware MEs. Secondly, considering time-varying computation tasks and resource conditions, the mixed integer non-linear programming(MINLP) problem is described as a Markov Decision Process(MDP). Then we propose a novel reinforcement learning-based optimization framework to work out the problem, instead traditional methods. Finally, it is shown that the proposed schemes can obviously raise the total revenue of MEs by giving simulation results. 展开更多
关键词 MEC computation offloading mobility-aware migration scheme markov decision process reinforcement learning
在线阅读 下载PDF
Feature-Based Aggregation and Deep Reinforcement Learning:A Survey and Some New Implementations 被引量:15
15
作者 Dimitri P.Bertsekas 《IEEE/CAA Journal of Automatica Sinica》 EI CSCD 2019年第1期1-31,共31页
In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinfor... In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement. 展开更多
关键词 reinforcement learning dynamic programming markovian decision problems AGGREGATION feature-based ARCHITECTURES policy ITERATION DEEP neural networks rollout algorithms
在线阅读 下载PDF
A Heterogeneous Information Fusion Deep Reinforcement Learning for Intelligent Frequency Selection of HF Communication 被引量:7
16
作者 Xin Liu Yuhua Xu +3 位作者 Yunpeng Cheng Yangyang Li Lei Zhao Xiaobo Zhang 《China Communications》 SCIE CSCD 2018年第9期73-84,共12页
The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the cro... The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the crowded spectrum, the time-varying channels, and the malicious intelligent jamming. The existing frequency hopping, automatic link establishment and some new anti-jamming technologies can not completely solve the above problems. In this article, we adopt deep reinforcement learning to solve this intractable challenge. First, the combination of the spectrum state and the channel gain state is defined as the complex environmental state, and the Markov characteristic of defined state is analyzed and proved. Then, considering that the spectrum state and channel gain state are heterogeneous information, a new deep Q network(DQN) framework is designed, which contains multiple sub-networks to process different kinds of information. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning(HIF-DRL) algorithm is designed for the specific frequency selection. Simulation results show that the proposed algorithm performs well in channel prediction, jamming avoidance and frequency channel selection. 展开更多
关键词 HF communication ANTI-JAMMING intelligent frequency selection markov decision process deep reinforcement learning
在线阅读 下载PDF
A guidance method for coplanar orbital interception based on reinforcement learning 被引量:7
17
作者 ZENG Xin ZHU Yanwei +1 位作者 YANG Leping ZHANG Chengming 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2021年第4期927-938,共12页
This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)mod... This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude. 展开更多
关键词 orbital interception reinforcement learning(RL) markov decision process(MDP) deep neural network(DNN)
在线阅读 下载PDF
Airport gate assignment problem with deep reinforcement learning 被引量:3
18
作者 Zhao Jiaming Wu Wenjun +3 位作者 Liu Zhiming Han Changhao Zhang Xuanyi Zhang Yanhua 《High Technology Letters》 EI CAS 2020年第1期102-107,共6页
With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time... With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity. 展开更多
关键词 AIRPORT gate ASSIGNMENT problem(AGAP) DEEP reinforcement learning(DRL) markov decision process(MDP)
在线阅读 下载PDF
Multi-task Coalition Parallel Formation Strategy Based on Reinforcement Learning 被引量:6
19
作者 JIANG Jian-Guo SU Zhao-Pin +1 位作者 QI Mei-Bin ZHANG Guo-Fu 《自动化学报》 EI CSCD 北大核心 2008年第3期349-352,共4页
代理人联盟是代理人协作和合作的一种重要方式。形成一个联盟,代理人能提高他们的能力解决问题并且获得更多的实用程序。在这份报纸,新奇多工联盟平行形成策略被介绍,并且多工联盟形成的过程是一个 Markov 决定过程的结论理论上被证... 代理人联盟是代理人协作和合作的一种重要方式。形成一个联盟,代理人能提高他们的能力解决问题并且获得更多的实用程序。在这份报纸,新奇多工联盟平行形成策略被介绍,并且多工联盟形成的过程是一个 Markov 决定过程的结论理论上被证明。而且,学习的加强被用来解决多工联盟平行的代理人行为策略,和这个过程形成被描述。在多工面向的领域,策略罐头有效地并且平行形式多工联盟。 展开更多
关键词 强化学习 多任务合并 平行排列 马尔可夫决策过程
在线阅读 下载PDF
Recorded recurrent deep reinforcement learning guidance laws for intercepting endoatmospheric maneuvering missiles 被引量:1
20
作者 Xiaoqi Qiu Peng Lai +1 位作者 Changsheng Gao Wuxing Jing 《Defence Technology(防务技术)》 SCIE EI CAS CSCD 2024年第1期457-470,共14页
This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with u... This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws. 展开更多
关键词 Endoatmospheric interception Missile guidance reinforcement learning markov decision process Recurrent neural networks
在线阅读 下载PDF
上一页 1 2 19 下一页 到第
使用帮助 返回顶部