In this paper,a distributed adaptive dynamic programming(ADP)framework based on value iteration is proposed for multi-player differential games.In the game setting,players have no access to the information of others...In this paper,a distributed adaptive dynamic programming(ADP)framework based on value iteration is proposed for multi-player differential games.In the game setting,players have no access to the information of others'system parameters or control laws.Each player adopts an on-policy value iteration algorithm as the basic learning framework.To deal with the incomplete information structure,players collect a period of system trajectory data to compensate for the lack of information.The policy updating step is implemented by a nonlinear optimization problem aiming to search for the proximal admissible policy.Theoretical analysis shows that by adopting proximal policy searching rules,the approximated policies can converge to a neighborhood of equilibrium policies.The efficacy of our method is illustrated by three examples,which also demonstrate that the proposed method can accelerate the learning process compared with the centralized learning framework.展开更多
This paper presents a novel cooperative value iteration(VI)-based adaptive dynamic programming method for multi-player differential game models with a convergence proof.The players are divided into two groups in the l...This paper presents a novel cooperative value iteration(VI)-based adaptive dynamic programming method for multi-player differential game models with a convergence proof.The players are divided into two groups in the learning process and adapt their policies sequentially.Our method removes the dependence of admissible initial policies,which is one of the main drawbacks of the PI-based frameworks.Furthermore,this algorithm enables the players to adapt their control policies without full knowledge of others’ system parameters or control laws.The efficacy of our method is illustrated by three examples.展开更多
In this paper,an accelerated value iteration(VI)algorithm is established to solve the zero-sum game problem with convergence guarantee.First,inspired by the successive over relaxation theory,the convergence rate of th...In this paper,an accelerated value iteration(VI)algorithm is established to solve the zero-sum game problem with convergence guarantee.First,inspired by the successive over relaxation theory,the convergence rate of the iterative value function sequence is accelerated significantly with the relaxation factor.Second,the convergence and monotonicity of the value function sequence are analyzed under different ranges of the relaxation factor.Third,two practical approaches,namely the integrated scheme and the relaxation function,are introduced into the accelerated VI algorithm to guarantee the convergence of the iterative value function sequence for zero-sum games.The integrated scheme consists of the accelerated stage and the convergence stage,and the relaxation function can adjust the value of the relaxation factor.Finally,including the autopilot controller,the fantastic performance of the accelerated VI algorithm is verified through two examples with practical physical backgrounds.展开更多
In an unmanned aerial vehicle ad-hoc network(UANET),sparse and rapidly mobile unmanned aerial vehicles(UAVs)/nodes can dynamically change the UANET topology.This may lead to UANET service performance issues.In this st...In an unmanned aerial vehicle ad-hoc network(UANET),sparse and rapidly mobile unmanned aerial vehicles(UAVs)/nodes can dynamically change the UANET topology.This may lead to UANET service performance issues.In this study,for planning rapidly changing UAV swarms,we propose a dynamic value iteration network(DVIN)model trained using the episodic Q-learning method with the connection information of UANETs to generate a state value spread function,which enables UAVs/nodes to adapt to novel physical locations.We then evaluate the performance of the DVIN model and compare it with the non-dominated sorting genetic algorithm II and the exhaustive method.Simulation results demonstrate that the proposed model significantly reduces the decisionmaking time for UAV/node path planning with a high average success rate.展开更多
In order to address the output feedback issue for linear discrete-time systems, this work suggests a brand-new adaptive dynamic programming(ADP) technique based on the internal model principle(IMP). The proposed metho...In order to address the output feedback issue for linear discrete-time systems, this work suggests a brand-new adaptive dynamic programming(ADP) technique based on the internal model principle(IMP). The proposed method, termed as IMP-ADP, does not require complete state feedback-merely the measurement of input and output data. More specifically, based on the IMP, the output control problem can first be converted into a stabilization problem. We then design an observer to reproduce the full state of the system by measuring the inputs and outputs. Moreover, this technique includes both a policy iteration algorithm and a value iteration algorithm to determine the optimal feedback gain without using a dynamic system model. It is important that with this concept one does not need to solve the regulator equation. Finally, this control method was tested on an inverter system of grid-connected LCLs to demonstrate that the proposed method provides the desired performance in terms of both tracking and disturbance rejection.展开更多
This article develops a novel data-driven safe Q-learning method to design the safe optimal controller which can guarantee constrained states of nonlinear systems always stay in the safe region while providing an opti...This article develops a novel data-driven safe Q-learning method to design the safe optimal controller which can guarantee constrained states of nonlinear systems always stay in the safe region while providing an optimal performance.First,we design an augmented utility function consisting of an adjustable positive definite control obstacle function and a quadratic form of the next state to ensure the safety and optimality.Second,by exploiting a pre-designed admissible policy for initialization,an off-policy stabilizing value iteration Q-learning(SVIQL)algorithm is presented to seek the safe optimal policy by using offline data within the safe region rather than the mathematical model.Third,the monotonicity,safety,and optimality of the SVIQL algorithm are theoretically proven.To obtain the initial admissible policy for SVIQL,an offline VIQL algorithm with zero initialization is constructed and a new admissibility criterion is established for immature iterative policies.Moreover,the critic and action networks with precise approximation ability are established to promote the operation of VIQL and SVIQL algorithms.Finally,three simulation experiments are conducted to demonstrate the virtue and superiority of the developed safe Q-learning method.展开更多
The core task of tracking control is to make the controlled plant track a desired trajectory.The traditional performance index used in previous studies cannot eliminate completely the tracking error as the number of t...The core task of tracking control is to make the controlled plant track a desired trajectory.The traditional performance index used in previous studies cannot eliminate completely the tracking error as the number of time steps increases.In this paper,a new cost function is introduced to develop the value-iteration-based adaptive critic framework to solve the tracking control problem.Unlike the regulator problem,the iterative value function of tracking control problem cannot be regarded as a Lyapunov function.A novel stability analysis method is developed to guarantee that the tracking error converges to zero.The discounted iterative scheme under the new cost function for the special case of linear systems is elaborated.Finally,the tracking performance of the present scheme is demonstrated by numerical results and compared with those of the traditional approaches.展开更多
Unmanned aerial vehicles(UAVs)can be employed as aerial base stations(BSs)due to their high mobility and flexible deployment.This paper focuses on a UAV-assisted wireless network,where users can be scheduled to get ac...Unmanned aerial vehicles(UAVs)can be employed as aerial base stations(BSs)due to their high mobility and flexible deployment.This paper focuses on a UAV-assisted wireless network,where users can be scheduled to get access to either an aerial BS or a terrestrial BS for uplink transmission.In contrast to state-of-the-art designs focusing on the instantaneous cost of the network,this paper aims at minimizing the long-term average transmit power consumed by the users by dynamically optimizing user association and power allocation in each time slot.Such a joint user association scheduling and power allocation problem can be formulated as a Markov decision process(MDP).Unfortunately,solving such an MDP problem with the conventional relative value iteration(RVI)can suffer from the curses of dimensionality,in the presence of a large number of users.As a countermeasure,we propose a distributed RVI algorithm to reduce the dimension of the MDP problem,such that the original problem can be decoupled into multiple solvable small-scale MDP problems.Simulation results reveal that the proposed algorithm can yield lower longterm average transmit power consumption than both the conventional RVI algorithm and a baseline algorithm with myopic policies.展开更多
In this paper, a reinforcement learning-based multibattery energy storage system(MBESS) scheduling policy is proposed to minimize the consumers ’ electricity cost. The MBESS scheduling problem is modeled as a Markov ...In this paper, a reinforcement learning-based multibattery energy storage system(MBESS) scheduling policy is proposed to minimize the consumers ’ electricity cost. The MBESS scheduling problem is modeled as a Markov decision process(MDP) with unknown transition probability. However, the optimal value function is time-dependent and difficult to obtain because of the periodicity of the electricity price and residential load. Therefore, a series of time-independent action-value functions are proposed to describe every period of a day. To approximate every action-value function, a corresponding critic network is established, which is cascaded with other critic networks according to the time sequence. Then, the continuous management strategy is obtained from the related action network. Moreover, a two-stage learning protocol including offline and online learning stages is provided for detailed implementation in real-time battery management. Numerical experimental examples are given to demonstrate the effectiveness of the developed algorithm.展开更多
By applying iterative technique,we obtain the existence of positive solutions for a singular Riemann-Stieltjes integral boundary value problem in the case that f(t,u) is non-increasing respect to u.
In this paper, we are concerned with the symmetric positive solutions of a 2n-order boundary value problems on time scales. By using induction principle,the symmetric form of the Green's function is established. In o...In this paper, we are concerned with the symmetric positive solutions of a 2n-order boundary value problems on time scales. By using induction principle,the symmetric form of the Green's function is established. In order to construct a necessary and sufficient condition for the existence result, the method of iterative technique will be used. As an application, an example is given to illustrate our main result.展开更多
The solution of minimum-time feedback optimal control problems is generally achieved using the dynamic programming approach,in which the value function must be computed on numerical grids with a very large number of p...The solution of minimum-time feedback optimal control problems is generally achieved using the dynamic programming approach,in which the value function must be computed on numerical grids with a very large number of points.Classical numerical strategies,such as value iteration(VI)or policy iteration(PI)methods,become very inefficient if the number of grid points is large.This is a strong limitation to their use in real-world applications.To address this problem,the authors present a novel multilevel framework,where classical VI and PI are embedded in a full-approximation storage(FAS)scheme.In fact,the authors will show that VI and PI have excellent smoothing properties,a fact that makes them very suitable for use in multilevel frameworks.Moreover,a new smoother is developed by accelerating VI using Anderson’s extrapolation technique.The effectiveness of our new scheme is demonstrated by several numerical experiments.展开更多
基金supported by the Aeronautical Science Foundation of China(20220001057001)an Open Project of the National Key Laboratory of Air-based Information Perception and Fusion(202437)
文摘In this paper,a distributed adaptive dynamic programming(ADP)framework based on value iteration is proposed for multi-player differential games.In the game setting,players have no access to the information of others'system parameters or control laws.Each player adopts an on-policy value iteration algorithm as the basic learning framework.To deal with the incomplete information structure,players collect a period of system trajectory data to compensate for the lack of information.The policy updating step is implemented by a nonlinear optimization problem aiming to search for the proximal admissible policy.Theoretical analysis shows that by adopting proximal policy searching rules,the approximated policies can converge to a neighborhood of equilibrium policies.The efficacy of our method is illustrated by three examples,which also demonstrate that the proposed method can accelerate the learning process compared with the centralized learning framework.
基金supported by the Industry-University-Research Cooperation Fund Project of the Eighth Research Institute of China Aerospace Science and Technology Corporation (USCAST2022-11)Aeronautical Science Foundation of China (20220001057001)。
文摘This paper presents a novel cooperative value iteration(VI)-based adaptive dynamic programming method for multi-player differential game models with a convergence proof.The players are divided into two groups in the learning process and adapt their policies sequentially.Our method removes the dependence of admissible initial policies,which is one of the main drawbacks of the PI-based frameworks.Furthermore,this algorithm enables the players to adapt their control policies without full knowledge of others’ system parameters or control laws.The efficacy of our method is illustrated by three examples.
基金supported in part by the National Natural Science Foundation of China under Grant 62222301,Grant 61890930-5,and Grant 62021003the National Science and Technology Major Project under Grant 2021ZD0112302 and Grant 2021ZD0112301the Beijing Natural Science Foundation under Grant JQ19013.
文摘In this paper,an accelerated value iteration(VI)algorithm is established to solve the zero-sum game problem with convergence guarantee.First,inspired by the successive over relaxation theory,the convergence rate of the iterative value function sequence is accelerated significantly with the relaxation factor.Second,the convergence and monotonicity of the value function sequence are analyzed under different ranges of the relaxation factor.Third,two practical approaches,namely the integrated scheme and the relaxation function,are introduced into the accelerated VI algorithm to guarantee the convergence of the iterative value function sequence for zero-sum games.The integrated scheme consists of the accelerated stage and the convergence stage,and the relaxation function can adjust the value of the relaxation factor.Finally,including the autopilot controller,the fantastic performance of the accelerated VI algorithm is verified through two examples with practical physical backgrounds.
基金Project supported by the National Natural Science Foundation of China(No.61501399)the SAIC MOTOR(No.1925)the National Key R&D Program of China(No.2018AAA0102302)。
文摘In an unmanned aerial vehicle ad-hoc network(UANET),sparse and rapidly mobile unmanned aerial vehicles(UAVs)/nodes can dynamically change the UANET topology.This may lead to UANET service performance issues.In this study,for planning rapidly changing UAV swarms,we propose a dynamic value iteration network(DVIN)model trained using the episodic Q-learning method with the connection information of UANETs to generate a state value spread function,which enables UAVs/nodes to adapt to novel physical locations.We then evaluate the performance of the DVIN model and compare it with the non-dominated sorting genetic algorithm II and the exhaustive method.Simulation results demonstrate that the proposed model significantly reduces the decisionmaking time for UAV/node path planning with a high average success rate.
基金supported by the National Science Fund for Distinguished Young Scholars (62225303)the Fundamental Research Funds for the Central Universities (buctrc202201)+1 种基金China Scholarship Council,and High Performance Computing PlatformCollege of Information Science and Technology,Beijing University of Chemical Technology。
文摘In order to address the output feedback issue for linear discrete-time systems, this work suggests a brand-new adaptive dynamic programming(ADP) technique based on the internal model principle(IMP). The proposed method, termed as IMP-ADP, does not require complete state feedback-merely the measurement of input and output data. More specifically, based on the IMP, the output control problem can first be converted into a stabilization problem. We then design an observer to reproduce the full state of the system by measuring the inputs and outputs. Moreover, this technique includes both a policy iteration algorithm and a value iteration algorithm to determine the optimal feedback gain without using a dynamic system model. It is important that with this concept one does not need to solve the regulator equation. Finally, this control method was tested on an inverter system of grid-connected LCLs to demonstrate that the proposed method provides the desired performance in terms of both tracking and disturbance rejection.
基金supported in part by the National Science and Technology Major Project(2021ZD0112302)the National Natural Science Foundation of China(62222301,61890930-5,62021003)。
文摘This article develops a novel data-driven safe Q-learning method to design the safe optimal controller which can guarantee constrained states of nonlinear systems always stay in the safe region while providing an optimal performance.First,we design an augmented utility function consisting of an adjustable positive definite control obstacle function and a quadratic form of the next state to ensure the safety and optimality.Second,by exploiting a pre-designed admissible policy for initialization,an off-policy stabilizing value iteration Q-learning(SVIQL)algorithm is presented to seek the safe optimal policy by using offline data within the safe region rather than the mathematical model.Third,the monotonicity,safety,and optimality of the SVIQL algorithm are theoretically proven.To obtain the initial admissible policy for SVIQL,an offline VIQL algorithm with zero initialization is constructed and a new admissibility criterion is established for immature iterative policies.Moreover,the critic and action networks with precise approximation ability are established to promote the operation of VIQL and SVIQL algorithms.Finally,three simulation experiments are conducted to demonstrate the virtue and superiority of the developed safe Q-learning method.
基金This work was supported in part by Beijing Natural Science Foundation(JQ19013)the National Key Research and Development Program of China(2021ZD0112302)the National Natural Science Foundation of China(61773373).
文摘The core task of tracking control is to make the controlled plant track a desired trajectory.The traditional performance index used in previous studies cannot eliminate completely the tracking error as the number of time steps increases.In this paper,a new cost function is introduced to develop the value-iteration-based adaptive critic framework to solve the tracking control problem.Unlike the regulator problem,the iterative value function of tracking control problem cannot be regarded as a Lyapunov function.A novel stability analysis method is developed to guarantee that the tracking error converges to zero.The discounted iterative scheme under the new cost function for the special case of linear systems is elaborated.Finally,the tracking performance of the present scheme is demonstrated by numerical results and compared with those of the traditional approaches.
基金This work was supported in part by the National Natural Science Foundation of China under Grant 61901216,61631020 and 61827801the Natural Science Foundation of Jiangsu Province under Grant BK20190400+1 种基金the open research fund of National Mobile Communications Research Laboratory,Southeast University(No.2020D08)the Foundation of Graduate Innovation Center in NUAA under Grant No.KFJJ20190408.
文摘Unmanned aerial vehicles(UAVs)can be employed as aerial base stations(BSs)due to their high mobility and flexible deployment.This paper focuses on a UAV-assisted wireless network,where users can be scheduled to get access to either an aerial BS or a terrestrial BS for uplink transmission.In contrast to state-of-the-art designs focusing on the instantaneous cost of the network,this paper aims at minimizing the long-term average transmit power consumed by the users by dynamically optimizing user association and power allocation in each time slot.Such a joint user association scheduling and power allocation problem can be formulated as a Markov decision process(MDP).Unfortunately,solving such an MDP problem with the conventional relative value iteration(RVI)can suffer from the curses of dimensionality,in the presence of a large number of users.As a countermeasure,we propose a distributed RVI algorithm to reduce the dimension of the MDP problem,such that the original problem can be decoupled into multiple solvable small-scale MDP problems.Simulation results reveal that the proposed algorithm can yield lower longterm average transmit power consumption than both the conventional RVI algorithm and a baseline algorithm with myopic policies.
基金supported by the National Key R&D Program of China (2018AAA0101400)the National Natural Science Foundation of China (61921004,62173251,U1713209,62236002)+1 种基金the Fundamental Research Funds for the Central UniversitiesGuangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control。
文摘In this paper, a reinforcement learning-based multibattery energy storage system(MBESS) scheduling policy is proposed to minimize the consumers ’ electricity cost. The MBESS scheduling problem is modeled as a Markov decision process(MDP) with unknown transition probability. However, the optimal value function is time-dependent and difficult to obtain because of the periodicity of the electricity price and residential load. Therefore, a series of time-independent action-value functions are proposed to describe every period of a day. To approximate every action-value function, a corresponding critic network is established, which is cascaded with other critic networks according to the time sequence. Then, the continuous management strategy is obtained from the related action network. Moreover, a two-stage learning protocol including offline and online learning stages is provided for detailed implementation in real-time battery management. Numerical experimental examples are given to demonstrate the effectiveness of the developed algorithm.
基金supported by Program for Scientific research innovation team in Colleges and universities of Shandong Provincethe Doctoral Program Foundation of Education Ministry of China(20133705110003)+1 种基金the Natural Science Foundation of Shandong Province of China(ZR2014AM007)the National Natural Science Foundation of China(11571197)
文摘By applying iterative technique,we obtain the existence of positive solutions for a singular Riemann-Stieltjes integral boundary value problem in the case that f(t,u) is non-increasing respect to u.
基金Supported by NNSF of China(11201213,11371183)NSF of Shandong Province(ZR2010AM022,ZR2013AM004)+2 种基金the Project of Shandong Provincial Higher Educational Science and Technology(J15LI07)the Project of Ludong University High-Quality Curriculum(20130345)the Teaching Reform Project of Ludong University in 2014(20140405)
文摘In this paper, we are concerned with the symmetric positive solutions of a 2n-order boundary value problems on time scales. By using induction principle,the symmetric form of the Green's function is established. In order to construct a necessary and sufficient condition for the existence result, the method of iterative technique will be used. As an application, an example is given to illustrate our main result.
文摘The solution of minimum-time feedback optimal control problems is generally achieved using the dynamic programming approach,in which the value function must be computed on numerical grids with a very large number of points.Classical numerical strategies,such as value iteration(VI)or policy iteration(PI)methods,become very inefficient if the number of grid points is large.This is a strong limitation to their use in real-world applications.To address this problem,the authors present a novel multilevel framework,where classical VI and PI are embedded in a full-approximation storage(FAS)scheme.In fact,the authors will show that VI and PI have excellent smoothing properties,a fact that makes them very suitable for use in multilevel frameworks.Moreover,a new smoother is developed by accelerating VI using Anderson’s extrapolation technique.The effectiveness of our new scheme is demonstrated by several numerical experiments.