Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the rob...Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the robust range for a certain optimal policy and to obtain value intervals of exact transition probabilities.Our research yields powerful contributions for Markov decision processes(MDPs)with uncertain transition probabilities.We first propose a method for estimating unknown transition probabilities based on maximum likelihood.Since the estimation may be far from accurate,and the highest expected total reward of the MDP may be sensitive to these transition probabilities,we analyze the robustness of an optimal policy and propose an approach for robust analysis.After giving the definition of a robust optimal policy with uncertain transition probabilities represented as sets of numbers,we formulate a model to obtain the optimal policy.Finally,we define the value intervals of the exact transition probabilities and construct models to determine the lower and upper bounds.Numerical examples are given to show the practicability of our methods.展开更多
Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper intro...Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.展开更多
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral cr...Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.展开更多
This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance mi...This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance minimization optimality equation and the existence of a variance minimal policy that is canonical, but also the existence of solutions to the two variance minimization optimality inequalities and the existence of a variance minimal policy which may not be canonical. An example is given to illustrate all of our conditions.展开更多
In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply ...In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply and demand on the road, and such mechanisms improve service capacity and quality. Seeking route recommendation has been widely studied in taxi service. In RoD services, the dynamic price is a new and accurate indicator that represents the supply and demand condition, but it is yet rarely studied in providing clues for drivers to seek for passengers. In this paper, we proposed to incorporate the impacts of dynamic prices as a key factor in recommending seeking routes to drivers. We first showed the importance and need to do that by analyzing real service data. We then designed a Markov Decision Process (MDP) model based on passenger order and car GPS trajectories datasets, and took into account dynamic prices in designing rewards. Results show that our model not only guides drivers to locations with higher prices, but also significantly improves driver revenue. Compared with things with the drivers before using the model, the maximum yield after using it can be increased to 28%.展开更多
Decision-making is the process of deciding between two or more options in order to take the most appropriate and successful course of action in order to achieve sustainable mangrove management. However, the distinctiv...Decision-making is the process of deciding between two or more options in order to take the most appropriate and successful course of action in order to achieve sustainable mangrove management. However, the distinctiveness of mangrove as an ecosystem, and thus the attendant socio-economic and governance ramifications, causes the idea of decision making to become relatively distinct from other decision making process As a result, the purpose of this research was to evaluate the impact that community engagement plays in the decision-making process as it relates to the establishment of governance norms for sustainable mangrove management in Lamu County. In this study, a correlational research design was applied, and the researchers employed a mixed techniques approach. The target population was 296 respondents. The research used questionnaires and interviews to collect data. A descriptive statistical technique was utilized to perform an inspection and analysis on the data that was gathered. The findings indicated that having awareness about governance standards is beneficial during the process of making decisions. In addition, the findings demonstrated that respondents had the impression that the decision-making process was not done properly. On the other hand, the participants pointed out the positive aspects of the decision-making process and agreed that the participation of both gender was essential for the sustainable management of mangroves. Based on these data, it appeared that full community engagement in decision-making is necessary for sustainable management of mangrove forests.展开更多
The design process of the built environment relies on the collaborative effort of all parties involved in the project.During the design phase,owners,end users,and their representatives are expected to make the most cr...The design process of the built environment relies on the collaborative effort of all parties involved in the project.During the design phase,owners,end users,and their representatives are expected to make the most critical design and budgetary decisions-shaping the essential traits of the project,hence emerge the need and necessity to create and integrate mechanisms to support the decision-making process.Design decisions should not be based on assumptions,past experiences,or imagination.An example of the numerous problems that are a result of uninformed design decisions is“change orders”,known as the deviation from the original scope of work,which leads to an increase of the overall cost,and changes to the construction schedule of the project.The long-term aim of this inquiry is to understand the user’s behavior,and establish evidence-based control measures,which are actions and processes that can be implemented in practice to decrease the volume and frequency of the occurrence of change orders.The current study developed a foundation for further examination by proposing potential control measures,and testing their efficiency,such as integrating Virtual Reality(VR).The specific aim was to examine the effect of different visualization methods(i.e.,VR vs.construction drawings)on,(1)how well the subjects understand the information presented about the future/planned environment;(2)the subjects’perceived confidence in what the future environment will look like;(3)the likelihood of changing the built environment;(4)design review time;and(5)accuracy in reviewing and understanding the design.展开更多
Cooperative multi-UAV search requires jointly optimizing wide-area coverage,rapid target discovery,and endurance under sensing and motion constraints.Resolving this coupling enables scalable coordination with high dat...Cooperative multi-UAV search requires jointly optimizing wide-area coverage,rapid target discovery,and endurance under sensing and motion constraints.Resolving this coupling enables scalable coordination with high data efficiency and mission reliability.We formulate this problem as a discounted Markov decision process on an occupancy grid with a cellwise Bayesian belief update,yielding a Markov state that couples agent poses with a probabilistic target field.On this belief–MDP we introduce a segment-conditioned latent-intent framework,in which a discrete intent head selects a latent skill every K steps and an intra-segment GRU policy generates per-step control conditioned on the fixed intent;both components are trained end-to-end with proximal updates under a centralized critic.On the 50×50 grid,coverage and discovery convergence times are reduced by up to 48%and 40%relative to a flat actor-critic benchmark,and the aggregated convergence metric improves by about 12%compared with a stateof-the-art hierarchical method.Qualitative analyses further reveal stable spatial sectorization,low path overlap,and fuel-aware patrolling,indicating that segment-conditioned latent intents provide an effective and scalable mechanism for coordinated multi-UAV search.展开更多
Underwater images frequently suffer from chromatic distortion,blurred details,and low contrast,posing significant challenges for enhancement.This paper introduces AquaTree,a novel underwater image enhancement(UIE)meth...Underwater images frequently suffer from chromatic distortion,blurred details,and low contrast,posing significant challenges for enhancement.This paper introduces AquaTree,a novel underwater image enhancement(UIE)method that reformulates the task as a Markov Decision Process(MDP)through the integration of Monte Carlo Tree Search(MCTS)and deep reinforcement learning(DRL).The framework employs an action space of 25 enhancement operators,strategically grouped for basic attribute adjustment,color component balance,correction,and deblurring.Exploration within MCTS is guided by a dual-branch convolutional network,enabling intelligent sequential operator selection.Our core contributions include:(1)a multimodal state representation combining CIELab color histograms with deep perceptual features,(2)a dual-objective reward mechanism optimizing chromatic fidelity and perceptual consistency,and(3)an alternating training strategy co-optimizing enhancement sequences and network parameters.We further propose two inference schemes:an MCTS-based approach prioritizing accuracy at higher computational cost,and an efficient network policy enabling real-time processing with minimal quality loss.Comprehensive evaluations on the UIEB Dataset and Color correction and haze removal comparisons on the U45 Dataset demonstrate AquaTree’s superiority,significantly outperforming nine state-of-the-art methods across five established underwater image quality metrics.展开更多
A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Consideri...A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs.展开更多
Alunite is the most important non bauxite resource for alumina. Various methods have been proposed and patented for processing alunite, but none has been performed at industrial scale and no technical,operational and ...Alunite is the most important non bauxite resource for alumina. Various methods have been proposed and patented for processing alunite, but none has been performed at industrial scale and no technical,operational and economic data is available to evaluate methods. In addition, selecting the right approach for alunite beneficiation, requires introducing a wide range of criteria and careful analysis of alternatives.In this research, after studying the existing processes, 13 methods were considered and evaluated by 14 technical, economic and environmental analyzing criteria. Due to multiplicity of processing methods and attributes, in this paper, Multi Attribute Decision Making methods were employed to examine the appropriateness of choices. The Delphi Analytical Hierarchy Process(DAHP) was used for weighting selection criteria and Fuzzy TOPSIS approach was used to determine the most profitable candidates. Among 13 studied methods, Spanish, Svoronos and Hazan methods were respectively recognized to be the best choices.展开更多
Decision in reality often have the characteristic of hierarchy because of the hierarchy of an organization's structure. In this paper, we propose a two-level hierarchic Markov decision model that considers the intera...Decision in reality often have the characteristic of hierarchy because of the hierarchy of an organization's structure. In this paper, we propose a two-level hierarchic Markov decision model that considers the interactions of agents in different levels and different time scales of levels. A backward induction algo- rithm is given for the model to solve the optimal policy of finite stage hierarchic decision problem. The proposed model and its algorithm are illustrated with an example about two-level hierar- chical decision problem of infrastructure maintenance. The opti- mal policy of the example is solved and the impacts of interactions between levels on decision making are analyzed.展开更多
An AI-aided simulation system embedded in a model-based, aspiration-led decision support system NY-IEDSS is reported. The NY-IEDSS is designed for mid-term development strategic study of the Nanyang Region in Henan, C...An AI-aided simulation system embedded in a model-based, aspiration-led decision support system NY-IEDSS is reported. The NY-IEDSS is designed for mid-term development strategic study of the Nanyang Region in Henan, China, and is getting beyond its prototype stage under the decision maker's (the end user) orientation. The integration of simulation model system, decision analysis and expert system for decision support in the system implementation was reviewed. The intent of the paper is to provide insight as to how system capability and acceptability can be enhanced by this integration. Moreover, emphasis is placed on problem orientation in applying the method.展开更多
Dear Editor,This letter introduces a novel approach to address the bearings-only target motion analysis(BO-TMA)problem by incorporating deep reinforcement learning(DRL)techniques.Conventional methods often exhibit bia...Dear Editor,This letter introduces a novel approach to address the bearings-only target motion analysis(BO-TMA)problem by incorporating deep reinforcement learning(DRL)techniques.Conventional methods often exhibit biases and struggle to achieve accurate results,especially when confronted with high levels of noise.In this letter,we formulate the BO-TMA problem as a Markov decision process(MDP)and process it within a DRL framework.Simulation results demonstrate that the proposed DRL-based estimator achieves reduced bias and lower errors compared to existing estimators.展开更多
The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article.We introduce scenario belief to describe the probability distribution of scenario...The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article.We introduce scenario belief to describe the probability distribution of scenarios in the system,and scenario expectation to formulate the expected total discounted reward of a policy.We establish a new framework named as double-factored Markov decision process(DFMDP),in which the physical state and scenario belief are shown to be the double factors serving as the sufficient statistics for the history of the decision process.Four classes of policies for the finite horizon DFMDPs are studied and it is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies.We also formulate the infinite horizon DFMDPs and present its optimality equation in this paper.An exact solution method named as double-factored backward induction for the finite horizon DFMDPs is proposed.It is utilized to find the optimal policies for the numeric examples and then compared with policies derived from other methods from the related literatures.展开更多
Dear Editor,This letter investigates the optimal transmission scheduling problem in remote state estimation systems over an unknown wireless channel.We propose a partially observable Markov decision Process(POMDP)fram...Dear Editor,This letter investigates the optimal transmission scheduling problem in remote state estimation systems over an unknown wireless channel.We propose a partially observable Markov decision Process(POMDP)framework to model the sensor scheduling problem.By truncating and simplifying the POMDP problem,we have established the properties of the optimal solution under the POMDP model,through a fixed-point contraction method,and have shown that the threshold structure of the POMDP solution is not easily attainable.Subsequently,we obtained a suboptimal solution via Qlearning.Numerical simulations are used to demonstrate the efficacy of the proposed Q-learning approach.展开更多
The Virtual Power Plant(VPP),as an innovative power management architecture,achieves flexible dispatch and resource optimization of power systems by integrating distributed energy resources.However,due to significant ...The Virtual Power Plant(VPP),as an innovative power management architecture,achieves flexible dispatch and resource optimization of power systems by integrating distributed energy resources.However,due to significant differences in operational costs and flexibility of various types of generation resources,as well as the volatility and uncertainty of renewable energy sources(such as wind and solar power)and the complex variability of load demand,the scheduling optimization of virtual power plants has become a critical issue that needs to be addressed.To solve this,this paper proposes an intelligent scheduling method for virtual power plants based on Deep Reinforcement Learning(DRL),utilizing Deep Q-Networks(DQN)for real-time optimization scheduling of dynamic peaking unit(DPU)and stable baseload unit(SBU)in the virtual power plant.By modeling the scheduling problem as a Markov Decision Process(MDP)and designing an optimization objective function that integrates both performance and cost,the scheduling efficiency and economic performance of the virtual power plant are significantly improved.Simulation results show that,compared with traditional scheduling methods and other deep reinforcement learning algorithms,the proposed method demonstrates significant advantages in key performance indicators:response time is shortened by up to 34%,task success rate is increased by up to 46%,and costs are reduced by approximately 26%.Experimental results verify the efficiency and scalability of the method under complex load environments and the volatility of renewable energy,providing strong technical support for the intelligent scheduling of virtual power plants.展开更多
The ability of mobile robots to plan and execute a path is foundational to various path-planning challenges,particularly Coverage Path Planning.While this task has been typically tackled with classical algorithms,thes...The ability of mobile robots to plan and execute a path is foundational to various path-planning challenges,particularly Coverage Path Planning.While this task has been typically tackled with classical algorithms,these often struggle with flexibility and adaptability in unknown environments.On the other hand,recent advances in Reinforcement Learning offer promising approaches,yet a significant gap in the literature remains when it comes to generalization over a large number of parameters.This paper presents a unified,generalized framework for coverage path planning that leverages value-based deep reinforcement learning techniques.The novelty of the framework comes from the design of an observation space that accommodates different map sizes,an action masking scheme that guarantees safety and robustness while also serving as a learning-fromdemonstration technique during training,and a unique reward function that yields value functions that are size-invariant.These are coupled with a curriculum learning-based training strategy and parametric environment randomization,enabling the agent to tackle complete or partial coverage path planning with perfect or incomplete knowledge while generalizing to different map sizes,configurations,sensor payloads,and sub-tasks.Our empirical results show that the algorithm can perform zero-shot learning scenarios at a near-optimal level in environments that follow a similar distribution as during training,outperforming a greedy heuristic by sixfold.Furthermore,in out-of-distribution environments,our method surpasses existing state-of-the-art algorithms in most zero-shot and all few-shot scenarios,paving the way for generalizable and adaptable path-planning algorithms.展开更多
This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved pr...This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.展开更多
Human-natural systems(HNS)are complex,adaptive systems where human activities and natural processes are deeply intertwined.Stochastic processes,nonlinear couplings,feedback loops,and emergent phenomena collectively sh...Human-natural systems(HNS)are complex,adaptive systems where human activities and natural processes are deeply intertwined.Stochastic processes,nonlinear couplings,feedback loops,and emergent phenomena collectively shape the interactions between human behaviors and natural dynamics.While environmental models in Earth system science are relatively well-established(equation-or process-based),modeling human systems remains insufficient.Moreover,there lacks a unified framework for effectively characterizing these complex interactions,posing significant challenges for HNS modeling and decision-making.To fill these gaps,this study proposes an integrated multi-agent deep reinforcement learning(MADRL)framework that combines Markov decision process(MDP),agent-based modeling(ABM),and deep reinforcement learning(DRL)to address modeling and decision-making challenges in HNS.The framework is structured as an MDP,defined by four core components:states of the environment(natural system),actions of agents(human system),transitions of the states(evolution of the HNS),and rewards.We introduce ABM to simulate human behaviors,decision-makings,and interactions among multi-hierarchical stakeholders,including individuals,groups,communities,governments,and non-governmental organizations.Additionally,DRL is employed to tackle the high-dimensional solving challenges of the MDP.Finally,a classic case study based on the“Tragedy of the Commons”is designed,featuring multiple fishermen operating under specific decision rules around a shared fishpond resource.The results demonstrate that under purely economic-driven incentives,fishermen tend to adopt high-intensity fishing strategies.This leads to fish populations rapidly declining from an initial 1,600 units to near-zero within 20 time steps,reproducing the classic“Tragedy of the Commons”phenomenon.In contrast,introducing sustainability penalty mechanisms or cooperative mechanisms effectively guides fishermen to adjust their fishing strategies.These mechanisms promote more stable and moderate fishing behaviors,maintaining fish populations at approximately 500 units and 1,500 units respectively throughout the simulation period.Furthermore,by incorporating behavioral parameters(greediness factors),the model effectively captures heterogeneity in fishing propensities.Highgreediness fishermen exhibit aggressive behaviors while low-greediness fishermen adopt more conservative strategies,revealing the impact of individual behavioral differences on system dynamics.These findings validate our proposed MADRL framework's ability to capture the dynamic feedback loops between heterogeneous agents and their environment,as well as emergent non-linear phenomena.By providing an integrated framework for analyzing and understanding these core mechanisms among multiple processes,agents,and activities of HNS,this study lays the foundation for future large-scale numerical experiments that address governance and decision-making challenges across multiple scales.展开更多
基金Supported by the National Natural Science Foundation of China(71571019).
文摘Optimal policies in Markov decision problems may be quite sensitive with regard to transition probabilities.In practice,some transition probabilities may be uncertain.The goals of the present study are to find the robust range for a certain optimal policy and to obtain value intervals of exact transition probabilities.Our research yields powerful contributions for Markov decision processes(MDPs)with uncertain transition probabilities.We first propose a method for estimating unknown transition probabilities based on maximum likelihood.Since the estimation may be far from accurate,and the highest expected total reward of the MDP may be sensitive to these transition probabilities,we analyze the robustness of an optimal policy and propose an approach for robust analysis.After giving the definition of a robust optimal policy with uncertain transition probabilities represented as sets of numbers,we formulate a model to obtain the optimal policy.Finally,we define the value intervals of the exact transition probabilities and construct models to determine the lower and upper bounds.Numerical examples are given to show the practicability of our methods.
基金partly supported by National Key R&D Program of China(No.2018YFA0306701)the Australian Research Council(Nos.DP160101652 and DP180100691)+1 种基金National Natural Science Foundation of China(No.61832015)the Key Research Program of Frontier Sciences,Chinese Academy of Sciences。
文摘Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.
文摘Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.
基金supported by the National Natural Science Foundation of China(10801056)the Natural Science Foundation of Ningbo(2010A610094)
文摘This paper studies the limit average variance criterion for continuous-time Markov decision processes in Polish spaces. Based on two approaches, this paper proves not only the existence of solutions to the variance minimization optimality equation and the existence of a variance minimal policy that is canonical, but also the existence of solutions to the two variance minimization optimality inequalities and the existence of a variance minimal policy which may not be canonical. An example is given to illustrate all of our conditions.
文摘In recent years, ride-on-demand (RoD) services such as Uber and Didi are becoming increasingly popular. Different from traditional taxi services, RoD services adopt dynamic pricing mechanisms to manipulate the supply and demand on the road, and such mechanisms improve service capacity and quality. Seeking route recommendation has been widely studied in taxi service. In RoD services, the dynamic price is a new and accurate indicator that represents the supply and demand condition, but it is yet rarely studied in providing clues for drivers to seek for passengers. In this paper, we proposed to incorporate the impacts of dynamic prices as a key factor in recommending seeking routes to drivers. We first showed the importance and need to do that by analyzing real service data. We then designed a Markov Decision Process (MDP) model based on passenger order and car GPS trajectories datasets, and took into account dynamic prices in designing rewards. Results show that our model not only guides drivers to locations with higher prices, but also significantly improves driver revenue. Compared with things with the drivers before using the model, the maximum yield after using it can be increased to 28%.
文摘Decision-making is the process of deciding between two or more options in order to take the most appropriate and successful course of action in order to achieve sustainable mangrove management. However, the distinctiveness of mangrove as an ecosystem, and thus the attendant socio-economic and governance ramifications, causes the idea of decision making to become relatively distinct from other decision making process As a result, the purpose of this research was to evaluate the impact that community engagement plays in the decision-making process as it relates to the establishment of governance norms for sustainable mangrove management in Lamu County. In this study, a correlational research design was applied, and the researchers employed a mixed techniques approach. The target population was 296 respondents. The research used questionnaires and interviews to collect data. A descriptive statistical technique was utilized to perform an inspection and analysis on the data that was gathered. The findings indicated that having awareness about governance standards is beneficial during the process of making decisions. In addition, the findings demonstrated that respondents had the impression that the decision-making process was not done properly. On the other hand, the participants pointed out the positive aspects of the decision-making process and agreed that the participation of both gender was essential for the sustainable management of mangroves. Based on these data, it appeared that full community engagement in decision-making is necessary for sustainable management of mangrove forests.
文摘The design process of the built environment relies on the collaborative effort of all parties involved in the project.During the design phase,owners,end users,and their representatives are expected to make the most critical design and budgetary decisions-shaping the essential traits of the project,hence emerge the need and necessity to create and integrate mechanisms to support the decision-making process.Design decisions should not be based on assumptions,past experiences,or imagination.An example of the numerous problems that are a result of uninformed design decisions is“change orders”,known as the deviation from the original scope of work,which leads to an increase of the overall cost,and changes to the construction schedule of the project.The long-term aim of this inquiry is to understand the user’s behavior,and establish evidence-based control measures,which are actions and processes that can be implemented in practice to decrease the volume and frequency of the occurrence of change orders.The current study developed a foundation for further examination by proposing potential control measures,and testing their efficiency,such as integrating Virtual Reality(VR).The specific aim was to examine the effect of different visualization methods(i.e.,VR vs.construction drawings)on,(1)how well the subjects understand the information presented about the future/planned environment;(2)the subjects’perceived confidence in what the future environment will look like;(3)the likelihood of changing the built environment;(4)design review time;and(5)accuracy in reviewing and understanding the design.
文摘Cooperative multi-UAV search requires jointly optimizing wide-area coverage,rapid target discovery,and endurance under sensing and motion constraints.Resolving this coupling enables scalable coordination with high data efficiency and mission reliability.We formulate this problem as a discounted Markov decision process on an occupancy grid with a cellwise Bayesian belief update,yielding a Markov state that couples agent poses with a probabilistic target field.On this belief–MDP we introduce a segment-conditioned latent-intent framework,in which a discrete intent head selects a latent skill every K steps and an intra-segment GRU policy generates per-step control conditioned on the fixed intent;both components are trained end-to-end with proximal updates under a centralized critic.On the 50×50 grid,coverage and discovery convergence times are reduced by up to 48%and 40%relative to a flat actor-critic benchmark,and the aggregated convergence metric improves by about 12%compared with a stateof-the-art hierarchical method.Qualitative analyses further reveal stable spatial sectorization,low path overlap,and fuel-aware patrolling,indicating that segment-conditioned latent intents provide an effective and scalable mechanism for coordinated multi-UAV search.
基金supported by theHubei Provincial Technology Innovation Special Project and the Natural Science Foundation of Hubei Province under Grants 2023BEB024,2024AFC066,respectively.
文摘Underwater images frequently suffer from chromatic distortion,blurred details,and low contrast,posing significant challenges for enhancement.This paper introduces AquaTree,a novel underwater image enhancement(UIE)method that reformulates the task as a Markov Decision Process(MDP)through the integration of Monte Carlo Tree Search(MCTS)and deep reinforcement learning(DRL).The framework employs an action space of 25 enhancement operators,strategically grouped for basic attribute adjustment,color component balance,correction,and deblurring.Exploration within MCTS is guided by a dual-branch convolutional network,enabling intelligent sequential operator selection.Our core contributions include:(1)a multimodal state representation combining CIELab color histograms with deep perceptual features,(2)a dual-objective reward mechanism optimizing chromatic fidelity and perceptual consistency,and(3)an alternating training strategy co-optimizing enhancement sequences and network parameters.We further propose two inference schemes:an MCTS-based approach prioritizing accuracy at higher computational cost,and an efficient network policy enabling real-time processing with minimal quality loss.Comprehensive evaluations on the UIEB Dataset and Color correction and haze removal comparisons on the U45 Dataset demonstrate AquaTree’s superiority,significantly outperforming nine state-of-the-art methods across five established underwater image quality metrics.
基金partially supported by Nation Science Foundation of China (61661025, 61661026)Foundation of A hundred Youth Talents Training Program of Lanzhou Jiaotong University (152022)
文摘A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs.
文摘Alunite is the most important non bauxite resource for alumina. Various methods have been proposed and patented for processing alunite, but none has been performed at industrial scale and no technical,operational and economic data is available to evaluate methods. In addition, selecting the right approach for alunite beneficiation, requires introducing a wide range of criteria and careful analysis of alternatives.In this research, after studying the existing processes, 13 methods were considered and evaluated by 14 technical, economic and environmental analyzing criteria. Due to multiplicity of processing methods and attributes, in this paper, Multi Attribute Decision Making methods were employed to examine the appropriateness of choices. The Delphi Analytical Hierarchy Process(DAHP) was used for weighting selection criteria and Fuzzy TOPSIS approach was used to determine the most profitable candidates. Among 13 studied methods, Spanish, Svoronos and Hazan methods were respectively recognized to be the best choices.
基金Supported by the National Natural Science Foundation of China (70971048)
文摘Decision in reality often have the characteristic of hierarchy because of the hierarchy of an organization's structure. In this paper, we propose a two-level hierarchic Markov decision model that considers the interactions of agents in different levels and different time scales of levels. A backward induction algo- rithm is given for the model to solve the optimal policy of finite stage hierarchic decision problem. The proposed model and its algorithm are illustrated with an example about two-level hierar- chical decision problem of infrastructure maintenance. The opti- mal policy of the example is solved and the impacts of interactions between levels on decision making are analyzed.
文摘An AI-aided simulation system embedded in a model-based, aspiration-led decision support system NY-IEDSS is reported. The NY-IEDSS is designed for mid-term development strategic study of the Nanyang Region in Henan, China, and is getting beyond its prototype stage under the decision maker's (the end user) orientation. The integration of simulation model system, decision analysis and expert system for decision support in the system implementation was reviewed. The intent of the paper is to provide insight as to how system capability and acceptability can be enhanced by this integration. Moreover, emphasis is placed on problem orientation in applying the method.
基金supported by the Zhejiang Provincial Natural Science Foundation of China(LZ23F030006)the National Natural Science Foundation of China(62173299,U23B2060)+1 种基金the Joint Fund of Ministry of Education for Pre-Research of Equipment(8091B022147,8091B032234,8091B042220)the Fundamental Research Funds for Xi’an Jiaotong University(xtr072022001).
文摘Dear Editor,This letter introduces a novel approach to address the bearings-only target motion analysis(BO-TMA)problem by incorporating deep reinforcement learning(DRL)techniques.Conventional methods often exhibit biases and struggle to achieve accurate results,especially when confronted with high levels of noise.In this letter,we formulate the BO-TMA problem as a Markov decision process(MDP)and process it within a DRL framework.Simulation results demonstrate that the proposed DRL-based estimator achieves reduced bias and lower errors compared to existing estimators.
基金supported by the(United States)National Science Foundation(No.1409214)。
文摘The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article.We introduce scenario belief to describe the probability distribution of scenarios in the system,and scenario expectation to formulate the expected total discounted reward of a policy.We establish a new framework named as double-factored Markov decision process(DFMDP),in which the physical state and scenario belief are shown to be the double factors serving as the sufficient statistics for the history of the decision process.Four classes of policies for the finite horizon DFMDPs are studied and it is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies.We also formulate the infinite horizon DFMDPs and present its optimality equation in this paper.An exact solution method named as double-factored backward induction for the finite horizon DFMDPs is proposed.It is utilized to find the optimal policies for the numeric examples and then compared with policies derived from other methods from the related literatures.
基金supported in part by the Frontier Technology R&D Plan of Jiangsu Province(BF2024065)the Shenzhen Science and Technology Program(JCYJ20230807114609019)Postgraduate Research&Practice Innovation Program of Jiangsu Province(KYCX22_0236).
文摘Dear Editor,This letter investigates the optimal transmission scheduling problem in remote state estimation systems over an unknown wireless channel.We propose a partially observable Markov decision Process(POMDP)framework to model the sensor scheduling problem.By truncating and simplifying the POMDP problem,we have established the properties of the optimal solution under the POMDP model,through a fixed-point contraction method,and have shown that the threshold structure of the POMDP solution is not easily attainable.Subsequently,we obtained a suboptimal solution via Qlearning.Numerical simulations are used to demonstrate the efficacy of the proposed Q-learning approach.
基金supported by the National Key Research and Development Program of China,Grant No.2020YFB0905900.
文摘The Virtual Power Plant(VPP),as an innovative power management architecture,achieves flexible dispatch and resource optimization of power systems by integrating distributed energy resources.However,due to significant differences in operational costs and flexibility of various types of generation resources,as well as the volatility and uncertainty of renewable energy sources(such as wind and solar power)and the complex variability of load demand,the scheduling optimization of virtual power plants has become a critical issue that needs to be addressed.To solve this,this paper proposes an intelligent scheduling method for virtual power plants based on Deep Reinforcement Learning(DRL),utilizing Deep Q-Networks(DQN)for real-time optimization scheduling of dynamic peaking unit(DPU)and stable baseload unit(SBU)in the virtual power plant.By modeling the scheduling problem as a Markov Decision Process(MDP)and designing an optimization objective function that integrates both performance and cost,the scheduling efficiency and economic performance of the virtual power plant are significantly improved.Simulation results show that,compared with traditional scheduling methods and other deep reinforcement learning algorithms,the proposed method demonstrates significant advantages in key performance indicators:response time is shortened by up to 34%,task success rate is increased by up to 46%,and costs are reduced by approximately 26%.Experimental results verify the efficiency and scalability of the method under complex load environments and the volatility of renewable energy,providing strong technical support for the intelligent scheduling of virtual power plants.
基金supported by project RELIABLE(PTDC/EEI-AUT/3522/2020)R&D Unit SYSTEC-Base(UIDB001472020)+1 种基金Programmatic(UIDP001472020)funds-and Associate Laboratory Advanced Production and Intelligent Systems ARISE-LAP01122020funded by national funds through the FCT/MCTES(PIDDAC).
文摘The ability of mobile robots to plan and execute a path is foundational to various path-planning challenges,particularly Coverage Path Planning.While this task has been typically tackled with classical algorithms,these often struggle with flexibility and adaptability in unknown environments.On the other hand,recent advances in Reinforcement Learning offer promising approaches,yet a significant gap in the literature remains when it comes to generalization over a large number of parameters.This paper presents a unified,generalized framework for coverage path planning that leverages value-based deep reinforcement learning techniques.The novelty of the framework comes from the design of an observation space that accommodates different map sizes,an action masking scheme that guarantees safety and robustness while also serving as a learning-fromdemonstration technique during training,and a unique reward function that yields value functions that are size-invariant.These are coupled with a curriculum learning-based training strategy and parametric environment randomization,enabling the agent to tackle complete or partial coverage path planning with perfect or incomplete knowledge while generalizing to different map sizes,configurations,sensor payloads,and sub-tasks.Our empirical results show that the algorithm can perform zero-shot learning scenarios at a near-optimal level in environments that follow a similar distribution as during training,outperforming a greedy heuristic by sixfold.Furthermore,in out-of-distribution environments,our method surpasses existing state-of-the-art algorithms in most zero-shot and all few-shot scenarios,paving the way for generalizable and adaptable path-planning algorithms.
基金partially supported by the National Key Research and Development Program of the Ministry of Science and Technology of China(2022YFE0114200)the National Natural Science Foundation of China(U20A6004).
文摘This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.
基金supported by the National Natural Science Foundation of China(Grant Nos.42301545,42430112)China Postdoctoral Science Foundation(Grant No.2023M733606)。
文摘Human-natural systems(HNS)are complex,adaptive systems where human activities and natural processes are deeply intertwined.Stochastic processes,nonlinear couplings,feedback loops,and emergent phenomena collectively shape the interactions between human behaviors and natural dynamics.While environmental models in Earth system science are relatively well-established(equation-or process-based),modeling human systems remains insufficient.Moreover,there lacks a unified framework for effectively characterizing these complex interactions,posing significant challenges for HNS modeling and decision-making.To fill these gaps,this study proposes an integrated multi-agent deep reinforcement learning(MADRL)framework that combines Markov decision process(MDP),agent-based modeling(ABM),and deep reinforcement learning(DRL)to address modeling and decision-making challenges in HNS.The framework is structured as an MDP,defined by four core components:states of the environment(natural system),actions of agents(human system),transitions of the states(evolution of the HNS),and rewards.We introduce ABM to simulate human behaviors,decision-makings,and interactions among multi-hierarchical stakeholders,including individuals,groups,communities,governments,and non-governmental organizations.Additionally,DRL is employed to tackle the high-dimensional solving challenges of the MDP.Finally,a classic case study based on the“Tragedy of the Commons”is designed,featuring multiple fishermen operating under specific decision rules around a shared fishpond resource.The results demonstrate that under purely economic-driven incentives,fishermen tend to adopt high-intensity fishing strategies.This leads to fish populations rapidly declining from an initial 1,600 units to near-zero within 20 time steps,reproducing the classic“Tragedy of the Commons”phenomenon.In contrast,introducing sustainability penalty mechanisms or cooperative mechanisms effectively guides fishermen to adjust their fishing strategies.These mechanisms promote more stable and moderate fishing behaviors,maintaining fish populations at approximately 500 units and 1,500 units respectively throughout the simulation period.Furthermore,by incorporating behavioral parameters(greediness factors),the model effectively captures heterogeneity in fishing propensities.Highgreediness fishermen exhibit aggressive behaviors while low-greediness fishermen adopt more conservative strategies,revealing the impact of individual behavioral differences on system dynamics.These findings validate our proposed MADRL framework's ability to capture the dynamic feedback loops between heterogeneous agents and their environment,as well as emergent non-linear phenomena.By providing an integrated framework for analyzing and understanding these core mechanisms among multiple processes,agents,and activities of HNS,this study lays the foundation for future large-scale numerical experiments that address governance and decision-making challenges across multiple scales.