期刊文献+

Pipe-RLHF:计算模式感知的RLHF并行加速框架 被引量:1

Pipe-RLHF:A Computation Mode-Aware Parallel Framework for RLHF
在线阅读 下载PDF
导出
摘要 基于人类反馈的强化学习(reinforcement learning with human feedback,RLHF)作为当前大语言模型(large language models,LLMs)对齐的主流方法,其核心优化算法——近端策略优化(proximal policy optimization,PPO)却面临着显著的效率问题.PPO由生成、推理、训练3个相互关联的阶段组成,各个阶段有着不同的计算特性.然而,现有的RLHF并行框架采用相同并行策略顺序执行PPO的所有阶段,这导致以下2个问题:其一,生成阶段不能充分利用计算资源,进而影响整体效率;其二,阶段间严格串行执行,未能充分利用潜在并行性.针对上述问题,提出了一个新型RLHF并行框架——Pipe-RLHF.该框架能够自适应地根据各阶段的计算特征确定最优并行策略,突破现有阶段串行范式,采用异步PPO算法发掘阶段间的并行性.具体而言,创新性地提出了适用于PPO生成阶段的延迟批间流水线并行方法,显著提升了该阶段的计算资源利用率;再次,使用异步PPO解放阶段间的依赖关系,将阶段间并行应用到PPO的加速上;最后,针对PPO算法的整体优化,构建了分层并行策略空间,并提出了一套优化算法以实现该空间中的最优解搜索.通过在多个大语言模型上的性能评估实验表明,相较于现有方法,Pipe-RLHF最高可实现3.7倍的加速比,充分验证了该框架的有效性和优越性. Reinforcement learning with human feedback(RLHF)has been proven effective in aligning large language models(LLMs)with human preferences.The most costly part of RLHF is proximal policy optimization(PPO),which consists of three dependent steps.Different PPO steps in RLHF exhibit different computation modes,simply employing the same parallelization strategy to accelerate all steps that involve multiple model variants,as done in existing frameworks,will lead to poor performance in the PPO generation step due to insufficient utilization of computational resources.Thus,we introduce Pipe-RLHF,a parallelism framework for RLHF fine-tuning,which adaptively employs distinct parallelization strategies for different steps according to the computation mode.Specifically,we first investigate the characteristics of various computation modes to explore their best-fit parallelization approach.And then,we present a novel delayed inter-batch pipeline parallelization approach specifically designed for the PPO generation step,enabling the sufficient utilization of computational resources.Subsequently,based on the proposed inter-batch pipeline parallelization approach,we define a hierarchical parallel plan space for distributed RLHF fine-tuning.Finally,we present optimization algorithms to find the optimal parallelization plan from the defined hierarchical parallel plan space to minimize the overall time consumption.Implementation and evaluation across multiple LLMs demonstrates that the proposed Pipe-RLHF achieves 3.7 times speedup compared with existing methods while achieving near-linear scalability.
作者 徐颖 王梦迪 程龙 刘炼 赵世新 张磊 王颖 Xu Ying;Wang Mengdi;Cheng Long;Liu Lian;Zhao Shixin;Zhang Lei;Wang Ying(Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190;University of Chinese Academy of Sciences,Beijing 100190;School of Control and Computer Engineering,North China Electric Power University,Beijing 102206)
出处 《计算机研究与发展》 北大核心 2025年第6期1513-1529,共17页 Journal of Computer Research and Development
基金 国家自然科学基金项目(92473205) 国家重点研发计划项目(2023YFB4404400)。
关键词 基于人类反馈的强化学习 近端策略优化 大模型微调 分布式系统 并行计算 reinforcement learning with human feedback(RLHF) proximal policy optimization(PPO) large language models fine-tuning distributed systems parallel computing
  • 相关文献

参考文献2

二级参考文献4

共引文献81

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部