This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the cha...This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness.展开更多
In this paper,we study the robustness property of policy optimization(particularly Gauss-Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning)subject to noise at each...In this paper,we study the robustness property of policy optimization(particularly Gauss-Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning)subject to noise at each iteration.By invoking the concept of input-to-state stability and utilizing Lyapunov's direct method,it is shown that,if the noise is sufficiently small,the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration.Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided.Based on Willems'fundamental lemma,a learning-based policy iteration algorithm is proposed.The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal.The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration.Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.展开更多
In order to achieve an intelligent and automated self-management network,dynamic policy configuration and selection are needed.A certain policy only suits to a certain network environment.If the network environment ch...In order to achieve an intelligent and automated self-management network,dynamic policy configuration and selection are needed.A certain policy only suits to a certain network environment.If the network environment changes,the certain policy does not suit any more.Thereby,the policy-based management should also have similar "natural selection" process.Useful policy will be retained,and policies which have lost their effectiveness are eliminated.A policy optimization method based on evolutionary learning was proposed.For different shooting times,the priority of policy with high shooting times is improved,while policy with a low rate has lower priority,and long-term no shooting policy will be dormant.Thus the strategy for the survival of the fittest is realized,and the degree of self-learning in policy management is improved.展开更多
Reinforcement learning encounters formidable challenges when tasked with intricate decision-making scenarios,primarily due to the expansive parameterized action spaces and the vastness of the corresponding policy land...Reinforcement learning encounters formidable challenges when tasked with intricate decision-making scenarios,primarily due to the expansive parameterized action spaces and the vastness of the corresponding policy landscapes.To surmount these difficulties,we devise a practical structured action graph model augmented by guiding policies that integrate trust region constraints.Based on this,we propose guided proximal policy optimization with structured action graph(GPPO-SAG),which has demonstrated pronounced efficacy in refining policy learning and enhancing performance across sophisticated tasks characterized by parameterized action spaces.Rigorous empirical evaluations of our model have been performed on comprehensive gaming platforms,including the entire suite of StarCraft II and Hearthstone,yielding exceptionally favorable outcomes.Our source code is at https://github.com/sachiel321/GPPO-SAG.展开更多
Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch ...Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain.展开更多
Endowing quadruped robots with the skill to forward jump is conducive to making it overcome barriers and pass through complex terrains.In this paper,a model-free control architecture with target-guided policy optimiza...Endowing quadruped robots with the skill to forward jump is conducive to making it overcome barriers and pass through complex terrains.In this paper,a model-free control architecture with target-guided policy optimization and deep reinforcement learn-ing(DRL)for quadruped robot jumping is presented.First,the jumping phase is divided into take-off and flight-landing phases,and op-timal strategies with soft actor-critic(SAC)are constructed for the two phases respectively.Second,policy learning including expecta-tions,penalties in the overall jumping process,and extrinsic excitations is designed.Corresponding policies and constraints are all provided for successful take-off,excellent flight attitude and stable standing after landing.In order to avoid low efficiency of random ex-ploration,a curiosity module is introduced as extrinsic rewards to solve this problem.Additionally,the target-guided module encour-ages the robot explore closer and closer to desired jumping target.Simulation results indicate that the quadruped robot can realize com-pleted forward jumping locomotion with good horizontal and vertical distances,as well as excellent motion attitudes.展开更多
This paper aims to explore the current status and challenges of international student education in China,with a focus on cross-cultural adaptation and institutional policies optimisation.It comes at a time when China ...This paper aims to explore the current status and challenges of international student education in China,with a focus on cross-cultural adaptation and institutional policies optimisation.It comes at a time when China is attracting more international students than ever,as part of the Belt and Road Initiative.However,international students also report significant cross-cultural adaptation challenges,including language issues,insufficient administrative support,and limited opportunities for social integration.This study,using a mixed-method approach that combines quantitative surveys and qualitative interviews,mainly with international students and university administrators from 10 leading Chinese universities,found that language proficiency is the biggest barrier to academic integration(78%of respondents reported it as a major barrier),and institutional support to cross-cultural adaptation often lags behind.For example,only 38%of international students felt that their universities provided sufficient support for cross-cultural adaption.The paper recommends reinforcing language support,providing cross-cultural sensitivity training for staff,and creating structured mentorship programmes to improve international students’academic and social integration in China.展开更多
This paper examines the transformation and development of the Xinhui Chenpi industry under the rural revitalization strategy in China.The study highlights the significant growth of the industry,with the annual product...This paper examines the transformation and development of the Xinhui Chenpi industry under the rural revitalization strategy in China.The study highlights the significant growth of the industry,with the annual production of chenpi reaching approximately 7,000 tons and the total output value surpassing 26 billion yuan in 2024.The paper proposes strategies to foster sustainable growth in industries facing challenges such as inefficient production processes,inconsistent product quality,and a lack of policy awareness among operators.These strategies include optimizing support policies,enhancing regulatory frameworks,and leveraging digital technologies for brand building and market expansion.The research contributes to understanding the development trajectory of the Xinhui Chenpi industry and provides insights for policymakers and industry practitioners.展开更多
We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled ...We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.展开更多
In this paper,we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning.The goal is to find a policy that maximizes the total expect...In this paper,we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning.The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy.The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy.We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region.To overcome the bias caused by sampling to the function estimations under the random settings,we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted.Moreover,for a Gaussian policy which is commonly used for continuous action space,the maximization with respect to the mean and covariance is performed separately to control the entropy loss.Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions.Comparisons with the state-of-the–art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.展开更多
In communication networks with policy-based Transport Control on-Demand (TCoD) function,the transport control policies play a great impact on the network effectiveness. To evaluate and optimize the transport policies ...In communication networks with policy-based Transport Control on-Demand (TCoD) function,the transport control policies play a great impact on the network effectiveness. To evaluate and optimize the transport policies in communication network,a policy-based TCoD network model is given and a comprehensive evaluation index system of the network effectiveness is put forward from both network application and handling mechanism perspectives. A TCoD network prototype system based on Asynchronous Transfer Mode/Multi-Protocol Label Switching (ATM/MPLS) is introduced and some experiments are performed on it. The prototype system is evaluated and analyzed with the comprehensive evaluation index system. The results show that the index system can be used to judge whether the communication network can meet the application requirements or not,and can provide references for the optimization of the transport policies so as to improve the communication network effectiveness.展开更多
At the beginning of 2025,China’s national carbon market carbon price trend exhibited a continuous unilateral downward trajectory,representing a departure from the overall steady upward trend in carbon prices since th...At the beginning of 2025,China’s national carbon market carbon price trend exhibited a continuous unilateral downward trajectory,representing a departure from the overall steady upward trend in carbon prices since the carbon market launched in 2021.The analysis suggests that the primary reason for the recent decline in carbon prices is the reversal of supply and demand dynamics in the carbon market,with increased quota supply amid a sluggish economy.It is expected that downward pressure on carbon prices will persist in the short term,but with more industries being included and continued policy optimization and improvement,a rise in China’s medium-to long-term carbon prices is highly probable.Recommendations for enterprises involved in carbon asset operations and management:first,refining carbon asset reserves and trading strategies;second,accelerating internal CCER project development;third,exploring carbon financial instrument applications;fourth,establishing and improving internal carbon pricing mechanisms;fifth,proactively planning for new industry inclusion.展开更多
Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-S...Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-SPPO)based on centralized decision process and considering policy entropy(S)is proposed.The proposed framework aims to plan the best scheduling scheme with the objective of minimizing both the timeout of order requests and the flight impact of UAVs that may lead to conflicts.In this framework,the intents of matching act are generated through the observations of UAV agents,and the ultimate conflict-free matching results are output under the guidance of a centralized decision maker.Concurrently,a pre-activation operation is introduced to further enhance the cooperation among UAV agents.Simulation experiments based on real-world data from New York City are conducted.The results indicate that the proposed CSPPO outperforms the baseline algorithms in the Average Delay Time(ADT),the Maximum Delay Time(MDT),the Order Delay Rate(ODR),the Average Flight Distance(AFD),and the Flight Impact Ratio(FIR).Furthermore,the framework demonstrates scalability to scenarios of different sizes without requiring additional training.展开更多
This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved pr...This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.展开更多
Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soar...Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference.展开更多
In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure link...In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure links is reused by vehicle-to-vehicle links.To this end,we model it as a problem of deep reinforcement learning and tackle it with prox-imal policy optimization.A considerable number of interactions are often required for training an agent with good performance,so simulation-based training is commonly used in communication networks.Nev-ertheless,severe performance degradation may occur when the agent is directly deployed in the real world,even though it can perform well on the simulator,due to the reality gap between the simulation and the real environments.To address this issue,we make prelim-inary efforts by proposing an algorithm based on meta reinforcement learning.This algorithm enables the agent to rapidly adapt to a new task with the knowl-edge extracted from similar tasks,leading to fewer in-teractions and less training time.Numerical results show that our method achieves near-optimal perfor-mance and exhibits rapid convergence.展开更多
Since last year,China’s inbound tourism market has accelerated its recovery.With the introduction and optimization of various facilitation policies and the development of new products,the inbound tourism market has s...Since last year,China’s inbound tourism market has accelerated its recovery.With the introduction and optimization of various facilitation policies and the development of new products,the inbound tourism market has shown unlimited potential for growth.According to data from the Data Center of the Ministry of Culture and Tourism,the number of inbound tourists reached a new high during the Spring Festival in 2025.The UK became China's third largest source of inbound tourists after the Republic of Korea and Japan.展开更多
This article studies the inshore-offshore fishery model with impulsive diffusion. The existence and global asymptotic stability of both the trivial periodic solution and the positive periodic solution are obtained. Th...This article studies the inshore-offshore fishery model with impulsive diffusion. The existence and global asymptotic stability of both the trivial periodic solution and the positive periodic solution are obtained. The complexity of this system is also analyzed. Moreover, the optimal harvesting policy are given for the inshore subpopulation, which includes the maximum sustainable yield and the corresponding harvesting effort.展开更多
This paper aims to improve the performance of a class of distributed parameter systems for the optimal switching of actuators and controllers based on event-driven control. It is assumed that in the available multiple...This paper aims to improve the performance of a class of distributed parameter systems for the optimal switching of actuators and controllers based on event-driven control. It is assumed that in the available multiple actuators, only one actuator can receive the control signal and be activated over an unfixed time interval, and the other actuators keep dormant. After incorporating a state observer into the event generator, the event-driven control loop and the minimum inter-event time are ultimately bounded. Based on the event-driven state feedback control, the time intervals of unfixed length can be obtained. The optimal switching policy is based on finite horizon linear quadratic optimal control at the beginning of each time subinterval. A simulation example demonstrate the effectiveness of the proposed policy.展开更多
This paper employs a stochastic endogenous growth model extended to the case of a recursive utility function which can disentangle intertemporal substitution from risk aversion to analyze productive government expendi...This paper employs a stochastic endogenous growth model extended to the case of a recursive utility function which can disentangle intertemporal substitution from risk aversion to analyze productive government expenditure and optimal fiscal policy, particularly stresses the importance of factor income. First, the explicit solutions of the central planner's stochastic optimization problem are derived, the growth maximizing and welfare-maximizing government expenditure policies are obtained and their standing in conflict or coincidence depends upon intertemporal substitution. Second, the explicit solutions of the representative individual's stochastic optimization problem which permits to tax on capital income and labor income separately are derived ,and it is found that the effect of risk on growth crucially depends on the degree of risk aversion,the intertemporal elasticity of substitution and the capital income share. Finally, a flexible optimal tax policy which can be internally adjusted to a certain extent is derived, and it is found that the distribution of factor income plays an important role in designing the optimal tax policy.展开更多
基金supported by the Foundation of Key Laboratory of System Control and Information Processing,Ministry of Education,China,Scip20240111Aeronautical Science Foundation of China,Grant 2024Z071108001the Foundation of Key Laboratory of Traffic Information and Safety of Anhui Higher Education Institutes,Anhui Sanlian University,KLAHEI18018.
文摘This paper employs the PPO(Proximal Policy Optimization) algorithm to study the risk hedging problem of the Shanghai Stock Exchange(SSE) 50ETF options. First, the action and state spaces were designed based on the characteristics of the hedging task, and a reward function was developed according to the cost function of the options. Second, combining the concept of curriculum learning, the agent was guided to adopt a simulated-to-real learning approach for dynamic hedging tasks, reducing the learning difficulty and addressing the issue of insufficient option data. A dynamic hedging strategy for 50ETF options was constructed. Finally, numerical experiments demonstrate the superiority of the designed algorithm over traditional hedging strategies in terms of hedging effectiveness.
基金supported in part by the National Science Foundation(Nos.ECCS-2210320,CNS-2148304).
文摘In this paper,we study the robustness property of policy optimization(particularly Gauss-Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning)subject to noise at each iteration.By invoking the concept of input-to-state stability and utilizing Lyapunov's direct method,it is shown that,if the noise is sufficiently small,the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration.Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided.Based on Willems'fundamental lemma,a learning-based policy iteration algorithm is proposed.The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal.The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration.Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.
基金National Natural Science Foundation of China(No.60534020)Cultivation Fund of the Key Scientific and Technical Innovation Project from Ministry of Education of China(No.706024)International Science Cooperation Foundation of Shanghai,China(No.061307041)
文摘In order to achieve an intelligent and automated self-management network,dynamic policy configuration and selection are needed.A certain policy only suits to a certain network environment.If the network environment changes,the certain policy does not suit any more.Thereby,the policy-based management should also have similar "natural selection" process.Useful policy will be retained,and policies which have lost their effectiveness are eliminated.A policy optimization method based on evolutionary learning was proposed.For different shooting times,the priority of policy with high shooting times is improved,while policy with a low rate has lower priority,and long-term no shooting policy will be dormant.Thus the strategy for the survival of the fittest is realized,and the degree of self-learning in policy management is improved.
基金supported by National Nature Science Foundation of China(Nos.62073324,6200629,61771471 and 91748131)in part by the InnoHK Project,China.
文摘Reinforcement learning encounters formidable challenges when tasked with intricate decision-making scenarios,primarily due to the expansive parameterized action spaces and the vastness of the corresponding policy landscapes.To surmount these difficulties,we devise a practical structured action graph model augmented by guiding policies that integrate trust region constraints.Based on this,we propose guided proximal policy optimization with structured action graph(GPPO-SAG),which has demonstrated pronounced efficacy in refining policy learning and enhancing performance across sophisticated tasks characterized by parameterized action spaces.Rigorous empirical evaluations of our model have been performed on comprehensive gaming platforms,including the entire suite of StarCraft II and Hearthstone,yielding exceptionally favorable outcomes.Our source code is at https://github.com/sachiel321/GPPO-SAG.
基金the National Natural Science Foundation of China(No.62103009)。
文摘Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic.The proximal policy optimization(PPO)algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity.To solve the problem,we propose a experience evolution proximal policy optimization(EEPPO)algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy.We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm.To verify the effectiveness of the proposed EEPPO algorithm,we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet.Experimental results show that the central pattern generator based radial basis function(CPG-RBF)network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed,posture and joints information.Experimental comparison results with the traditional soft actor-critic(SAC)algorithm validate the superiority of the proposed EEPPO algorithm,which can learn a more stable diagonal trot gait in flat terrain.
基金National Natural Science Foundation of China(No.61773374)National Key Research and Development Program of China(No.2017YFB1300104).
文摘Endowing quadruped robots with the skill to forward jump is conducive to making it overcome barriers and pass through complex terrains.In this paper,a model-free control architecture with target-guided policy optimization and deep reinforcement learn-ing(DRL)for quadruped robot jumping is presented.First,the jumping phase is divided into take-off and flight-landing phases,and op-timal strategies with soft actor-critic(SAC)are constructed for the two phases respectively.Second,policy learning including expecta-tions,penalties in the overall jumping process,and extrinsic excitations is designed.Corresponding policies and constraints are all provided for successful take-off,excellent flight attitude and stable standing after landing.In order to avoid low efficiency of random ex-ploration,a curiosity module is introduced as extrinsic rewards to solve this problem.Additionally,the target-guided module encour-ages the robot explore closer and closer to desired jumping target.Simulation results indicate that the quadruped robot can realize com-pleted forward jumping locomotion with good horizontal and vertical distances,as well as excellent motion attitudes.
文摘This paper aims to explore the current status and challenges of international student education in China,with a focus on cross-cultural adaptation and institutional policies optimisation.It comes at a time when China is attracting more international students than ever,as part of the Belt and Road Initiative.However,international students also report significant cross-cultural adaptation challenges,including language issues,insufficient administrative support,and limited opportunities for social integration.This study,using a mixed-method approach that combines quantitative surveys and qualitative interviews,mainly with international students and university administrators from 10 leading Chinese universities,found that language proficiency is the biggest barrier to academic integration(78%of respondents reported it as a major barrier),and institutional support to cross-cultural adaptation often lags behind.For example,only 38%of international students felt that their universities provided sufficient support for cross-cultural adaption.The paper recommends reinforcing language support,providing cross-cultural sensitivity training for staff,and creating structured mentorship programmes to improve international students’academic and social integration in China.
基金Research on the Digital Transformation of the Xinhui Dried Tangerine Peel Industry under the Rural Revitalization Strategy(2023HSQX100)。
文摘This paper examines the transformation and development of the Xinhui Chenpi industry under the rural revitalization strategy in China.The study highlights the significant growth of the industry,with the annual production of chenpi reaching approximately 7,000 tons and the total output value surpassing 26 billion yuan in 2024.The paper proposes strategies to foster sustainable growth in industries facing challenges such as inefficient production processes,inconsistent product quality,and a lack of policy awareness among operators.These strategies include optimizing support policies,enhancing regulatory frameworks,and leveraging digital technologies for brand building and market expansion.The research contributes to understanding the development trajectory of the Xinhui Chenpi industry and provides insights for policymakers and industry practitioners.
基金Project supported by the National Key R&D Program of China(No.2018AAA0101400)the National Natural Science Foundation of China(Nos.61973074,U1713209,61520106009,and 61533008)+1 种基金the Science and Technology on Information System Engineering Laboratory(No.05201902)the Fundamental Research Funds for the Central Universities,China。
文摘We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.
基金The computational results were obtained at GPUs supported by the National Engineering Laboratory for Big Data Analysis and Applications and the High-performance Computing Platform of Peking University.
文摘In this paper,we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning.The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy.The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy.We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region.To overcome the bias caused by sampling to the function estimations under the random settings,we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted.Moreover,for a Gaussian policy which is commonly used for continuous action space,the maximization with respect to the mean and covariance is performed separately to control the entropy loss.Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions.Comparisons with the state-of-the–art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.
基金Supported by the National 863 Program (No.2007AA-701210)
文摘In communication networks with policy-based Transport Control on-Demand (TCoD) function,the transport control policies play a great impact on the network effectiveness. To evaluate and optimize the transport policies in communication network,a policy-based TCoD network model is given and a comprehensive evaluation index system of the network effectiveness is put forward from both network application and handling mechanism perspectives. A TCoD network prototype system based on Asynchronous Transfer Mode/Multi-Protocol Label Switching (ATM/MPLS) is introduced and some experiments are performed on it. The prototype system is evaluated and analyzed with the comprehensive evaluation index system. The results show that the index system can be used to judge whether the communication network can meet the application requirements or not,and can provide references for the optimization of the transport policies so as to improve the communication network effectiveness.
文摘At the beginning of 2025,China’s national carbon market carbon price trend exhibited a continuous unilateral downward trajectory,representing a departure from the overall steady upward trend in carbon prices since the carbon market launched in 2021.The analysis suggests that the primary reason for the recent decline in carbon prices is the reversal of supply and demand dynamics in the carbon market,with increased quota supply amid a sluggish economy.It is expected that downward pressure on carbon prices will persist in the short term,but with more industries being included and continued policy optimization and improvement,a rise in China’s medium-to long-term carbon prices is highly probable.Recommendations for enterprises involved in carbon asset operations and management:first,refining carbon asset reserves and trading strategies;second,accelerating internal CCER project development;third,exploring carbon financial instrument applications;fourth,establishing and improving internal carbon pricing mechanisms;fifth,proactively planning for new industry inclusion.
基金the support of the Chinese Special Research Project for Civil Aircraft(No.MJZ17N22)the National Natural Science Foundation of China(Nos.U2133207,U2333214)+1 种基金the China Postdoctoral Science Foundation(No.2023M741687)the National Social Science Fund of China(No.22&ZD169)。
文摘Unmanned Aerial Vehicle(UAV)stands as a burgeoning electric transportation carrier,holding substantial promise for the logistics sector.A reinforcement learning framework Centralized-S Proximal Policy Optimization(C-SPPO)based on centralized decision process and considering policy entropy(S)is proposed.The proposed framework aims to plan the best scheduling scheme with the objective of minimizing both the timeout of order requests and the flight impact of UAVs that may lead to conflicts.In this framework,the intents of matching act are generated through the observations of UAV agents,and the ultimate conflict-free matching results are output under the guidance of a centralized decision maker.Concurrently,a pre-activation operation is introduced to further enhance the cooperation among UAV agents.Simulation experiments based on real-world data from New York City are conducted.The results indicate that the proposed CSPPO outperforms the baseline algorithms in the Average Delay Time(ADT),the Maximum Delay Time(MDT),the Order Delay Rate(ODR),the Average Flight Distance(AFD),and the Flight Impact Ratio(FIR).Furthermore,the framework demonstrates scalability to scenarios of different sizes without requiring additional training.
基金partially supported by the National Key Research and Development Program of the Ministry of Science and Technology of China(2022YFE0114200)the National Natural Science Foundation of China(U20A6004).
文摘This paper investigates a distributed heterogeneous hybrid blocking flow-shop scheduling problem(DHHBFSP)designed to minimize the total tardiness and total energy consumption simultaneously,and proposes an improved proximal policy optimization(IPPO)method to make real-time decisions for the DHHBFSP.A multi-objective Markov decision process is modeled for the DHHBFSP,where the reward function is represented by a vector with dynamic weights instead of the common objectiverelated scalar value.A factory agent(FA)is formulated for each factory to select unscheduled jobs and is trained by the proposed IPPO to improve the decision quality.Multiple FAs work asynchronously to allocate jobs that arrive randomly at the shop.A two-stage training strategy is introduced in the IPPO,which learns from both single-and dual-policy data for better data utilization.The proposed IPPO is tested on randomly generated instances and compared with variants of the basic proximal policy optimization(PPO),dispatch rules,multi-objective metaheuristics,and multi-agent reinforcement learning methods.Extensive experimental results suggest that the proposed strategies offer significant improvements to the basic PPO,and the proposed IPPO outperforms the state-of-the-art scheduling methods in both convergence and solution quality.
基金support received by the National Natural Science Foundation of China(Grant Nos.52372398&62003272).
文摘Dynamic soaring,inspired by the wind-riding flight of birds such as albatrosses,is a biomimetic technique which leverages wind fields to enhance the endurance of unmanned aerial vehicles(UAVs).Achieving a precise soaring trajectory is crucial for maximizing energy efficiency during flight.Existing nonlinear programming methods are heavily dependent on the choice of initial values which is hard to determine.Therefore,this paper introduces a deep reinforcement learning method based on a differentially flat model for dynamic soaring trajectory planning and optimization.Initially,the gliding trajectory is parameterized using Fourier basis functions,achieving a flexible trajectory representation with a minimal number of hyperparameters.Subsequently,the trajectory optimization problem is formulated as a dynamic interactive process of Markov decision-making.The hyperparameters of the trajectory are optimized using the Proximal Policy Optimization(PPO2)algorithm from deep reinforcement learning(DRL),reducing the strong reliance on initial value settings in the optimization process.Finally,a comparison between the proposed method and the nonlinear programming method reveals that the trajectory generated by the proposed approach is smoother while meeting the same performance requirements.Specifically,the proposed method achieves a 34%reduction in maximum thrust,a 39.4%decrease in maximum thrust difference,and a 33%reduction in maximum airspeed difference.
基金L.Liang was supported in part by the Natural Science Foundation of Jiangsu Province under Grant BK20220810in part by the National Natural Science Foundation of China under Grant 62201145 and Grant 62231019S.Jin was supported in part by the National Natural Science Foundation of China(NSFC)under Grants 62261160576,62341107,61921004。
文摘In this paper,we investigate the problem of fast spectrum sharing in vehicle-to-everything com-munication.In order to improve the spectrum effi-ciency of the whole system,the spectrum of vehicle-to-infrastructure links is reused by vehicle-to-vehicle links.To this end,we model it as a problem of deep reinforcement learning and tackle it with prox-imal policy optimization.A considerable number of interactions are often required for training an agent with good performance,so simulation-based training is commonly used in communication networks.Nev-ertheless,severe performance degradation may occur when the agent is directly deployed in the real world,even though it can perform well on the simulator,due to the reality gap between the simulation and the real environments.To address this issue,we make prelim-inary efforts by proposing an algorithm based on meta reinforcement learning.This algorithm enables the agent to rapidly adapt to a new task with the knowl-edge extracted from similar tasks,leading to fewer in-teractions and less training time.Numerical results show that our method achieves near-optimal perfor-mance and exhibits rapid convergence.
文摘Since last year,China’s inbound tourism market has accelerated its recovery.With the introduction and optimization of various facilitation policies and the development of new products,the inbound tourism market has shown unlimited potential for growth.According to data from the Data Center of the Ministry of Culture and Tourism,the number of inbound tourists reached a new high during the Spring Festival in 2025.The UK became China's third largest source of inbound tourists after the Republic of Korea and Japan.
文摘This article studies the inshore-offshore fishery model with impulsive diffusion. The existence and global asymptotic stability of both the trivial periodic solution and the positive periodic solution are obtained. The complexity of this system is also analyzed. Moreover, the optimal harvesting policy are given for the inshore subpopulation, which includes the maximum sustainable yield and the corresponding harvesting effort.
基金supported by the National Natural Science Foundation of China(Grant Nos.61174021 and 61104155)the Fundamental Research Funds for theCentral Universities,China(Grant Nos.JUDCF13037 and JUSRP51322B)+1 种基金the Programme of Introducing Talents of Discipline to Universities,China(GrantNo.B12018)the Jiangsu Innovation Program for Graduates,China(Grant No.CXZZ13-0740)
文摘This paper aims to improve the performance of a class of distributed parameter systems for the optimal switching of actuators and controllers based on event-driven control. It is assumed that in the available multiple actuators, only one actuator can receive the control signal and be activated over an unfixed time interval, and the other actuators keep dormant. After incorporating a state observer into the event generator, the event-driven control loop and the minimum inter-event time are ultimately bounded. Based on the event-driven state feedback control, the time intervals of unfixed length can be obtained. The optimal switching policy is based on finite horizon linear quadratic optimal control at the beginning of each time subinterval. A simulation example demonstrate the effectiveness of the proposed policy.
文摘This paper employs a stochastic endogenous growth model extended to the case of a recursive utility function which can disentangle intertemporal substitution from risk aversion to analyze productive government expenditure and optimal fiscal policy, particularly stresses the importance of factor income. First, the explicit solutions of the central planner's stochastic optimization problem are derived, the growth maximizing and welfare-maximizing government expenditure policies are obtained and their standing in conflict or coincidence depends upon intertemporal substitution. Second, the explicit solutions of the representative individual's stochastic optimization problem which permits to tax on capital income and labor income separately are derived ,and it is found that the effect of risk on growth crucially depends on the degree of risk aversion,the intertemporal elasticity of substitution and the capital income share. Finally, a flexible optimal tax policy which can be internally adjusted to a certain extent is derived, and it is found that the distribution of factor income plays an important role in designing the optimal tax policy.