摘要
斗地主是一种典型的非完备信息博弈,由于具有多人博弈、动作空间庞大、合作与竞争并存等决策需求,单一的蒙特卡洛树搜索在应用时存在效率低的问题。为提升蒙特卡洛树搜索的策略效果和搜索效率,提出一种基于近端策略优化(proximal policy optimization,PPO)算法结合蒙特卡洛树搜索的斗地主博弈模型。利用PPO算法学习斗地主中的牌局和策略信息,训练出可根据当前局面提供动作概率的策略模型,为蒙特卡洛树搜索的选择和模拟阶段提供策略指导。在选择阶段,通过PPO策略模型输出的动作概率优化策略选择公式,指导高质量动作节点的选择。在模拟阶段,PPO替代了随机模拟过程,使模拟更加符合策略,减少低效路径的探索。实验结果表明:结合PPO优化后的蒙特卡洛树搜索不仅提高了决策的效率,还提升了模型的胜率,表现出较强的斗地主博弈决策优势。
DouDiZhu is a typical imperfect information game,whose decision-making involves multiple players,the huge action space,and the coexistence of cooperation and competition,leading to low efficiency in a single Monte Carlo Tree Search(MCTS).To improve the strategy and the efficiency of search for MCTS,the model for DouDiZhu game is proposed based on the Proximal Policy Optimization(PPO)algorithm combined with MCTS.First,PPO algorithm is employed to learn the game and strategy information and train a strategy model that provides action probability according to the current situation,offering strategy guidance for the selection and simulation stage of MCTS.Then,the selection formula is adjusted by the action probability output of PPO strategy model to guide the selection of high-quality action nodes.Finally,PPO replaces the random simulation process,which makes the simulation more consistent with the strategy and reduces the exploration of inefficient paths.Results show MCTS combined with PPO,the optimized MCTS not only improves the efficiency of decision-making,but also markedly increases the probability of victory,demonstrating its superiority in the decision-making process of the game of DouDiZhu.
作者
王世鹏
王亚杰
吴燕燕
郭其龙
赵甜宇
WANG Shipeng;WANG Yajie;WU Yanyan;GUO Qilong;ZHAO Tianyu(School of Computer,Shenyang Aerospace University,Shenyang 110136,China;Engineering Training Center,Shenyang Aerospace University,Shenyang 110136,China)
出处
《重庆理工大学学报(自然科学)》
北大核心
2025年第8期126-133,共8页
Journal of Chongqing University of Technology:Natural Science
基金
中国科协科普能力提升项目(KXYJS2022092)。