针对深度强化学习在多智能体环境中普遍存在的特征与奖励机制难以匹配、从而导致算法有效性与适用性不足的问题,提出了一种架构–特征–奖励协同设计框架(AFRD),用于系统性地指导单智能体方法向多智能体场景扩展。该框架依托CTDE(centra...针对深度强化学习在多智能体环境中普遍存在的特征与奖励机制难以匹配、从而导致算法有效性与适用性不足的问题,提出了一种架构–特征–奖励协同设计框架(AFRD),用于系统性地指导单智能体方法向多智能体场景扩展。该框架依托CTDE(centralized training with decentralized execution),在特征层面引入关键的本地与全局信息,在奖励层面对齐个体目标与系统整体目标,从而形成具有可迁移性的设计思路。接着以边缘计算任务卸载为应用背景,基于AFRD框架在PPO算法上实现了AFRD-PPO,并在三种典型卸载模式下开展实验,对比不同特征与奖励机制组合的收敛性能表现,并进一步分析其对收敛平稳性的影响。实验结果表明,AFRD框架能够有效提升深度强化学习在多智能体环境中的收敛稳定性与适用性。研究为相关领域的研究与应用提供了有益的参考与借鉴。展开更多
Opportunistic mobile crowdsensing(MCS)non-intrusively exploits human mobility trajectories,and the participants’smart devices as sensors have become promising paradigms for various urban data acquisition tasks.Howeve...Opportunistic mobile crowdsensing(MCS)non-intrusively exploits human mobility trajectories,and the participants’smart devices as sensors have become promising paradigms for various urban data acquisition tasks.However,in practice,opportunistic MCS has several challenges from both the perspectives of MCS participants and the data platform.On the one hand,participants face uncertainties in conducting MCS tasks,including their mobility and implicit interactions among participants,and participants’economic returns given by the MCS data platform are determined by not only their own actions but also other participants’strategic actions.On the other hand,the platform can only observe the participants’uploaded sensing data that depends on the unknown effort/action exerted by participants to the platform,while,for optimizing its overall objective,the platform needs to properly reward certain participants for incentivizing them to provide high-quality data.To address the challenge of balancing individual incentives and platform objectives in MCS,this paper proposes MARCS,an online sensing policy based on multi-agent deep reinforcement learning(MADRL)with centralized training and decentralized execution(CTDE).Specifically,the interactions between MCS participants and the data platform are modeled as a partially observable Markov game,where participants,acting as agents,use DRL-based policies to make decisions based on local observations,such as task trajectories and platform payments.To align individual and platform goals effectively,the platform leverages Shapley value to estimate the contribution of each participant’s sensed data,using these estimates as immediate rewards to guide agent training.The experimental results on real mobility trajectory datasets indicate that the revenue of MARCS reaches almost 35%,53%,and 100%higher than DDPG,Actor-Critic,and model predictive control(MPC)respectively on the participant side and similar results on the platform side,which show superior performance compared to baselines.展开更多
文摘针对深度强化学习在多智能体环境中普遍存在的特征与奖励机制难以匹配、从而导致算法有效性与适用性不足的问题,提出了一种架构–特征–奖励协同设计框架(AFRD),用于系统性地指导单智能体方法向多智能体场景扩展。该框架依托CTDE(centralized training with decentralized execution),在特征层面引入关键的本地与全局信息,在奖励层面对齐个体目标与系统整体目标,从而形成具有可迁移性的设计思路。接着以边缘计算任务卸载为应用背景,基于AFRD框架在PPO算法上实现了AFRD-PPO,并在三种典型卸载模式下开展实验,对比不同特征与奖励机制组合的收敛性能表现,并进一步分析其对收敛平稳性的影响。实验结果表明,AFRD框架能够有效提升深度强化学习在多智能体环境中的收敛稳定性与适用性。研究为相关领域的研究与应用提供了有益的参考与借鉴。
基金sponsored by Qinglan Project of Jiangsu Province,and Jiangsu Provincial Key Research and Development Program(No.BE2020084-1).
文摘Opportunistic mobile crowdsensing(MCS)non-intrusively exploits human mobility trajectories,and the participants’smart devices as sensors have become promising paradigms for various urban data acquisition tasks.However,in practice,opportunistic MCS has several challenges from both the perspectives of MCS participants and the data platform.On the one hand,participants face uncertainties in conducting MCS tasks,including their mobility and implicit interactions among participants,and participants’economic returns given by the MCS data platform are determined by not only their own actions but also other participants’strategic actions.On the other hand,the platform can only observe the participants’uploaded sensing data that depends on the unknown effort/action exerted by participants to the platform,while,for optimizing its overall objective,the platform needs to properly reward certain participants for incentivizing them to provide high-quality data.To address the challenge of balancing individual incentives and platform objectives in MCS,this paper proposes MARCS,an online sensing policy based on multi-agent deep reinforcement learning(MADRL)with centralized training and decentralized execution(CTDE).Specifically,the interactions between MCS participants and the data platform are modeled as a partially observable Markov game,where participants,acting as agents,use DRL-based policies to make decisions based on local observations,such as task trajectories and platform payments.To align individual and platform goals effectively,the platform leverages Shapley value to estimate the contribution of each participant’s sensed data,using these estimates as immediate rewards to guide agent training.The experimental results on real mobility trajectory datasets indicate that the revenue of MARCS reaches almost 35%,53%,and 100%higher than DDPG,Actor-Critic,and model predictive control(MPC)respectively on the participant side and similar results on the platform side,which show superior performance compared to baselines.